SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
www.guidetopharmacology.org
Looking at the gift horse: pros and cons of patent-
extracted structures in PubChem
Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative
Physiology, University of Edinburgh. ICIC Heidelberg, Monday 23rd Oct 2017
https://www.slideshare.net/secret/v4A5eUTuYvT28X
1
22 million
Abstract (will be skipped for the presentation)
2
As of August 2017, the major automated patent chemistry extractions (in ascending size,
NextMove, SCRIPDB, IBM and SureChEMBL) are included submitters for 21.5 million CIDs from
the PubChem total of 93.8. The following aspects will be expanded in this presentation, starting
with advantages; a) while the relative coverage between open and commercial sources is difficult
to determine (PMID 26457120) it is clear that the majority of patent-exemplified structures of
medicinal chemistry interest (i.e. from C07 plus A61) are now in PubChem b) this allows most
first-filings of lead series and clinical candidates to be tracked d) the PubChem tool box has
query, analysis, clustering and linking features difficult to match in commercial sources, e) many
structures can be associated with bioactivity data f) connections between manually curated
papers and patents can be made via the 0.48 million CID intersects with ChEMBL. However,
looking more closely also indicates disadvantages; a) extraction coverage is compromised by
dense image tables and poor OCR quality of WO documents, b) SureChEMBL is the only major
open pipeline continuously running in situ but has a PubChem updating lag, c) automated
extraction generates structural “noise” that degrades chemistry quality d) PubChem patent
document metadata indexing is patchy (although better for SureChEMBL in situ) d) nothing in the
records indicates IP status, e) continual re-extraction of common chemistry results in over-
mapping (e.g. 126,949 patents for aspirin and 14,294 for atorvastatin), f) authentic compounds
are contaminated with spurious mixtures and never-made virtuals, including 1000s of deuterated
drugs g) linking between assay data and targets is still a manual exercise. However, all things
considered the PubChem patent “big bang” presents users with the best of both worlds (PMID
26194581). Academics or smaller enterprises who cannot afford commercial solutions can now
patent mine extensively. Even for those with commercial subscriptions, PubChem has become
an essential adjunct/complementary source for the analysis of patent chemistry and associated
bio entities such as diseases and drug targets.
Outline
• History of patent chemistry feeds to PubChem
• Relative source contributions
• Caveats with automated extraction
• Source intersects
• Fragmentation
• Source extraction comparisons
• Circularity for virtuals
• Mixtures
• Lag times
• Conclusions
• References
• Workshop alert
3
Chemical Named Entity Recognition (CNER)
• Automated process of documents in > structures out
• SureChEMBL pipeline shown above, other sources similar
• Name-to-Struc (n2s) by look-up and/or IUPAC translation, image-to-
struc (i2s) and mol files from USPTO Complex Work Units (CWUs)
• Indexing usually added e.g. abstract, descriptions, claims
• As well as patents, IBM run PubMed abstracts and PMC
4
History of patent chemistry feeds into PubChem
• 2006 Thomson (now Clavariat) Pharma, manual extractions from patents
and papers, 4.3 mil (but ceased Jan 2016)
• 2011 IBM phase 1 Chemical Named Entity Recognition (CNER) 2.5 mil
• SLING Consortium EPO extraction 0.1 mil
• 2012 SCRIPDB, CNER + Complex Work Units (CWU) 4.0 mil
• 2013 SureChem, CNER + image, 9.0 mil
• 2014 BindingDB manual activity curation 0.13 mill
• 2015 (CNER+images + CWU)
• SureChEMBL 13.0 mil
• IBM phase 2, 7.0 mil,
• NextMove Software 1.4 mil synthesis mapping
• 2016 SureChEMBL 15.8 mil
• 2017 IBM Phase 3, 6.0 mill
5
2011 “fizzle” > 2015 “big bang”
6
Pro: Oct 2017, from 93.89 mill PubChem CIDs
7
Pro: PubChem indexes IPC splits
Con: document indexing is USPTO
dominated (i.e. early WO’s missed)
Con: Entrez cant handle the joins
8
Con: Mw plots reveal CNER fragmentation
9
ChEMBL + Thomson
Pharma = 5.6 million
manual extraction
Patent CNER
= 21.8 million
Con: those “Chessbordanes” still hanging around……
10
Pros & cons arising from intersects and filters
11
Intersects and diffs for major CNER sources
Pro: corroboration, Con: divergence
12
IBM = 10.7
SCRIPDB = 4.0
SureChEMBL = 17.6
2.9
2.4
4.7 10.1
0.6 0.4
0.50
Counts (Oct 2017)
are CIDs in millions
Union = 21.7
3-way = 2.4
3 + 2-way = 8.1
Unique= 13.5
Con: circular extraction of virtual enumerations
13
1511 codeine
records, mainly 563
deuterations from
Auspex US7872013
> 3-source
multiplexing
652 InChI key inner
layer records via 266
stereos of vorapaxar
via Schering
US20080085923 >
4-source multiplexing
in UniChem
Pro: good coverage, con: not complete
• Compared SureChEMBL and IBM with SciFinder and Reaxys for a small
patent set (i.e. open vs commercial)
• Concluded; “50–66 % of the relevant content from the latter was also
found in the former”
• Equivalent comparisons in the latest PubChem would record a higher overlap
• Probability of completely missing a recently exemplified series completely
getting lower
14
Managing expectations: assessment of chemistry databases generated by
automated extraction of chemical structures from patents, Senger, et al. J.
Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)
http://www.ncbi.nlm.nih.gov/pubmed/26457120
Examining extraction
selectivity for same patent
15
Coverage from US9181236
Pro: convergence, Con: divergence
16
• 173 BindingDB CIDs
curated from PubChem via
US9181236
• 405 substances SDF from
SciFinder OpenBabel > 391
IK > 362 CIDs
• 1657 rows > 834
SureChEMBL IDs > 664
CIDs
• 3-way Venn of CIDs
Con: the common chemistry problem
17
Spurious patent < > cpd indexing: aspirin = 131,410, atorvastatin = 14,968,
ethanol = 72,027
Con: the mixtures problem
18
Con: no open automated SAR extraction
Pro: DIY manual SAR extraction aligned to PubChem structures
Pro: ~2K patents have target-mapped BindingDB curated SAR
19
• SAR table from WO2016096979, Jansen BACE1 inhibitors
• Left to right, page from the PDF, SureChEMBL mark-up and Excel paste-across
Con: Lag in SureChEMBL> PubChem synch times
• Internal UniChem load at EBI, 10 Oct = 18691416
• PubChem submission, 07 Oct = 17687607
• Latest in situ entries below for 12 Oct
• Extraction in SureChEMBL within a week or less of pub date
20
Con: IBM CNER > 80% of all PubChem < > PMID links
21
• IBM extracts PubMed abstracts as
well as patents
• PubChem < > structures to PMID
• Automated associations swamp
out expert-curated assignments
• Specificity/accuracy is equivocal
Conclusions
• For the PubChem patent chemistry “Big Bang” the pros massively outweigh
the cons (i.e. it’s not a bad horse …)
• Contributors are to be congratulated and PubChem for wrangling them
• However, it is important to look closely at the gift horse…..
• Users need to understand CNER quirks, pitfalls and confounding artefacts
• PubChem slicing and filtering can partially ameliorate these
• Activity-to-target mapping for SAR extraction still pinch point
• Open extraction is a crucial comparator for commercial efforts
• Those without commercial sources are well enabled for patent mining
• Those with commercial sources can synergise with open searching
22
Info
23
http://cdsouthan.blogspot.com/ many posts have the tag “patents”
http://www.ncbi.nlm.nih.gov/pubmed/26194581
http://www.guidetopharmacology.org/
http://www.sciencedirect.com/science/article/pii/B9780124095472138144
Questions? (but wait …. there’s more, a Tuesday tutorial)
24

Contenu connexe

Tendances

ICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBL
Dr. Haxel Consult
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Tendances (20)

Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
 
ICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBL
 
Presentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public MeetingPresentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public Meeting
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
 
Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...
 
Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
 
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
 
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...What chemicals constitute the Exposome? Accessing data via the US EPA’s  Comp...
What chemicals constitute the Exposome? Accessing data via the US EPA’s Comp...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
 

En vedette

ICIC 2017: The Next Era: Deep Learning for Biomedical Research
ICIC 2017: The Next Era: Deep Learning for Biomedical ResearchICIC 2017: The Next Era: Deep Learning for Biomedical Research
ICIC 2017: The Next Era: Deep Learning for Biomedical Research
Dr. Haxel Consult
 
29th ICIC International Conference for the Information Community
29th ICIC International Conference for the Information Community29th ICIC International Conference for the Information Community
29th ICIC International Conference for the Information Community
Dr. Haxel Consult
 

En vedette (13)

ICIC 2017: New Poduct presentations CFL Software
ICIC 2017: New Poduct presentations CFL SoftwareICIC 2017: New Poduct presentations CFL Software
ICIC 2017: New Poduct presentations CFL Software
 
ICIC 2017: New Poduct presentations BizInt
ICIC 2017: New Poduct presentations BizIntICIC 2017: New Poduct presentations BizInt
ICIC 2017: New Poduct presentations BizInt
 
ICIC 2017: New product presentationsLighthouse IP
ICIC 2017: New product presentationsLighthouse IPICIC 2017: New product presentationsLighthouse IP
ICIC 2017: New product presentationsLighthouse IP
 
ICIC 2017: Product presentations FIZ Karlsruhe
ICIC 2017: Product presentations FIZ KarlsruheICIC 2017: Product presentations FIZ Karlsruhe
ICIC 2017: Product presentations FIZ Karlsruhe
 
ICIC 2017: New product presentation EXPERT SYSTEM
ICIC 2017: New product presentation EXPERT SYSTEMICIC 2017: New product presentation EXPERT SYSTEM
ICIC 2017: New product presentation EXPERT SYSTEM
 
ICIC 2017: New Poduct presentations InfoChem
ICIC 2017: New Poduct presentations InfoChemICIC 2017: New Poduct presentations InfoChem
ICIC 2017: New Poduct presentations InfoChem
 
ICIC 2017: New product presentation minesoft
ICIC 2017: New product presentation minesoftICIC 2017: New product presentation minesoft
ICIC 2017: New product presentation minesoft
 
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
 
ICIC 2017: The Next Era: Deep Learning for Biomedical Research
ICIC 2017: The Next Era: Deep Learning for Biomedical ResearchICIC 2017: The Next Era: Deep Learning for Biomedical Research
ICIC 2017: The Next Era: Deep Learning for Biomedical Research
 
ICIC 2017: How to effectively monitor Technological Developments in IP
ICIC 2017: How to effectively monitor Technological Developments in IPICIC 2017: How to effectively monitor Technological Developments in IP
ICIC 2017: How to effectively monitor Technological Developments in IP
 
ICIC 2017: Dealing with Patent Families in FTO Searches, Portfolio Analysis a...
ICIC 2017: Dealing with Patent Families in FTO Searches, Portfolio Analysis a...ICIC 2017: Dealing with Patent Families in FTO Searches, Portfolio Analysis a...
ICIC 2017: Dealing with Patent Families in FTO Searches, Portfolio Analysis a...
 
ICIC 2017:
ICIC 2017: ICIC 2017:
ICIC 2017:
 
29th ICIC International Conference for the Information Community
29th ICIC International Conference for the Information Community29th ICIC International Conference for the Information Community
29th ICIC International Conference for the Information Community
 

Similaire à ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent-extracted structures in PubChem

The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
Dr. Haxel Consult
 
Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...
Chris Southan
 
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Sean Ekins
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
Sunghwan Kim
 
Chemistry data delivery from the US-EPA to support environmental chemistry
Chemistry data delivery from the US-EPA to support environmental chemistryChemistry data delivery from the US-EPA to support environmental chemistry
Chemistry data delivery from the US-EPA to support environmental chemistry
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
Sunghwan Kim
 

Similaire à ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent-extracted structures in PubChem (20)

Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChem
 
Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEs
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
 
Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...
 
Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases
 
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
 
Using open data, services and source software to deliver the EPA CompTox Chem...
Using open data, services and source software to deliver the EPA CompTox Chem...Using open data, services and source software to deliver the EPA CompTox Chem...
Using open data, services and source software to deliver the EPA CompTox Chem...
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
Multiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChemMultiplexing analysis of 1000 approved drugs in PubChem
Multiplexing analysis of 1000 approved drugs in PubChem
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
Chemistry data delivery from the US-EPA to support environmental chemistry
Chemistry data delivery from the US-EPA to support environmental chemistryChemistry data delivery from the US-EPA to support environmental chemistry
Chemistry data delivery from the US-EPA to support environmental chemistry
 
Open PHACTS (Sept 2013) EBI Industry Programme
Open PHACTS (Sept 2013) EBI Industry ProgrammeOpen PHACTS (Sept 2013) EBI Industry Programme
Open PHACTS (Sept 2013) EBI Industry Programme
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information training
 
GPU-accelerated Virtual Screening
GPU-accelerated Virtual ScreeningGPU-accelerated Virtual Screening
GPU-accelerated Virtual Screening
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
 
How can you access PubChem programmatically?
How can you access PubChem programmatically?How can you access PubChem programmatically?
How can you access PubChem programmatically?
 
Citizen_Science_Environmental Monitoring_Pittcon_2017
Citizen_Science_Environmental Monitoring_Pittcon_2017Citizen_Science_Environmental Monitoring_Pittcon_2017
Citizen_Science_Environmental Monitoring_Pittcon_2017
 

Plus de Dr. Haxel Consult

AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
Dr. Haxel Consult
 

Plus de Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 

Dernier

Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
anilsa9823
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
ellan12
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
Diya Sharma
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 

Dernier (20)

𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
INDIVIDUAL ASSIGNMENT #3 CBG, PRESENTATION.
INDIVIDUAL ASSIGNMENT #3 CBG, PRESENTATION.INDIVIDUAL ASSIGNMENT #3 CBG, PRESENTATION.
INDIVIDUAL ASSIGNMENT #3 CBG, PRESENTATION.
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night StandHot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 

ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent-extracted structures in PubChem

  • 1. www.guidetopharmacology.org Looking at the gift horse: pros and cons of patent- extracted structures in PubChem Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative Physiology, University of Edinburgh. ICIC Heidelberg, Monday 23rd Oct 2017 https://www.slideshare.net/secret/v4A5eUTuYvT28X 1 22 million
  • 2. Abstract (will be skipped for the presentation) 2 As of August 2017, the major automated patent chemistry extractions (in ascending size, NextMove, SCRIPDB, IBM and SureChEMBL) are included submitters for 21.5 million CIDs from the PubChem total of 93.8. The following aspects will be expanded in this presentation, starting with advantages; a) while the relative coverage between open and commercial sources is difficult to determine (PMID 26457120) it is clear that the majority of patent-exemplified structures of medicinal chemistry interest (i.e. from C07 plus A61) are now in PubChem b) this allows most first-filings of lead series and clinical candidates to be tracked d) the PubChem tool box has query, analysis, clustering and linking features difficult to match in commercial sources, e) many structures can be associated with bioactivity data f) connections between manually curated papers and patents can be made via the 0.48 million CID intersects with ChEMBL. However, looking more closely also indicates disadvantages; a) extraction coverage is compromised by dense image tables and poor OCR quality of WO documents, b) SureChEMBL is the only major open pipeline continuously running in situ but has a PubChem updating lag, c) automated extraction generates structural “noise” that degrades chemistry quality d) PubChem patent document metadata indexing is patchy (although better for SureChEMBL in situ) d) nothing in the records indicates IP status, e) continual re-extraction of common chemistry results in over- mapping (e.g. 126,949 patents for aspirin and 14,294 for atorvastatin), f) authentic compounds are contaminated with spurious mixtures and never-made virtuals, including 1000s of deuterated drugs g) linking between assay data and targets is still a manual exercise. However, all things considered the PubChem patent “big bang” presents users with the best of both worlds (PMID 26194581). Academics or smaller enterprises who cannot afford commercial solutions can now patent mine extensively. Even for those with commercial subscriptions, PubChem has become an essential adjunct/complementary source for the analysis of patent chemistry and associated bio entities such as diseases and drug targets.
  • 3. Outline • History of patent chemistry feeds to PubChem • Relative source contributions • Caveats with automated extraction • Source intersects • Fragmentation • Source extraction comparisons • Circularity for virtuals • Mixtures • Lag times • Conclusions • References • Workshop alert 3
  • 4. Chemical Named Entity Recognition (CNER) • Automated process of documents in > structures out • SureChEMBL pipeline shown above, other sources similar • Name-to-Struc (n2s) by look-up and/or IUPAC translation, image-to- struc (i2s) and mol files from USPTO Complex Work Units (CWUs) • Indexing usually added e.g. abstract, descriptions, claims • As well as patents, IBM run PubMed abstracts and PMC 4
  • 5. History of patent chemistry feeds into PubChem • 2006 Thomson (now Clavariat) Pharma, manual extractions from patents and papers, 4.3 mil (but ceased Jan 2016) • 2011 IBM phase 1 Chemical Named Entity Recognition (CNER) 2.5 mil • SLING Consortium EPO extraction 0.1 mil • 2012 SCRIPDB, CNER + Complex Work Units (CWU) 4.0 mil • 2013 SureChem, CNER + image, 9.0 mil • 2014 BindingDB manual activity curation 0.13 mill • 2015 (CNER+images + CWU) • SureChEMBL 13.0 mil • IBM phase 2, 7.0 mil, • NextMove Software 1.4 mil synthesis mapping • 2016 SureChEMBL 15.8 mil • 2017 IBM Phase 3, 6.0 mill 5
  • 6. 2011 “fizzle” > 2015 “big bang” 6
  • 7. Pro: Oct 2017, from 93.89 mill PubChem CIDs 7
  • 8. Pro: PubChem indexes IPC splits Con: document indexing is USPTO dominated (i.e. early WO’s missed) Con: Entrez cant handle the joins 8
  • 9. Con: Mw plots reveal CNER fragmentation 9 ChEMBL + Thomson Pharma = 5.6 million manual extraction Patent CNER = 21.8 million
  • 10. Con: those “Chessbordanes” still hanging around…… 10
  • 11. Pros & cons arising from intersects and filters 11
  • 12. Intersects and diffs for major CNER sources Pro: corroboration, Con: divergence 12 IBM = 10.7 SCRIPDB = 4.0 SureChEMBL = 17.6 2.9 2.4 4.7 10.1 0.6 0.4 0.50 Counts (Oct 2017) are CIDs in millions Union = 21.7 3-way = 2.4 3 + 2-way = 8.1 Unique= 13.5
  • 13. Con: circular extraction of virtual enumerations 13 1511 codeine records, mainly 563 deuterations from Auspex US7872013 > 3-source multiplexing 652 InChI key inner layer records via 266 stereos of vorapaxar via Schering US20080085923 > 4-source multiplexing in UniChem
  • 14. Pro: good coverage, con: not complete • Compared SureChEMBL and IBM with SciFinder and Reaxys for a small patent set (i.e. open vs commercial) • Concluded; “50–66 % of the relevant content from the latter was also found in the former” • Equivalent comparisons in the latest PubChem would record a higher overlap • Probability of completely missing a recently exemplified series completely getting lower 14 Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents, Senger, et al. J. Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL) http://www.ncbi.nlm.nih.gov/pubmed/26457120
  • 16. Coverage from US9181236 Pro: convergence, Con: divergence 16 • 173 BindingDB CIDs curated from PubChem via US9181236 • 405 substances SDF from SciFinder OpenBabel > 391 IK > 362 CIDs • 1657 rows > 834 SureChEMBL IDs > 664 CIDs • 3-way Venn of CIDs
  • 17. Con: the common chemistry problem 17 Spurious patent < > cpd indexing: aspirin = 131,410, atorvastatin = 14,968, ethanol = 72,027
  • 18. Con: the mixtures problem 18
  • 19. Con: no open automated SAR extraction Pro: DIY manual SAR extraction aligned to PubChem structures Pro: ~2K patents have target-mapped BindingDB curated SAR 19 • SAR table from WO2016096979, Jansen BACE1 inhibitors • Left to right, page from the PDF, SureChEMBL mark-up and Excel paste-across
  • 20. Con: Lag in SureChEMBL> PubChem synch times • Internal UniChem load at EBI, 10 Oct = 18691416 • PubChem submission, 07 Oct = 17687607 • Latest in situ entries below for 12 Oct • Extraction in SureChEMBL within a week or less of pub date 20
  • 21. Con: IBM CNER > 80% of all PubChem < > PMID links 21 • IBM extracts PubMed abstracts as well as patents • PubChem < > structures to PMID • Automated associations swamp out expert-curated assignments • Specificity/accuracy is equivocal
  • 22. Conclusions • For the PubChem patent chemistry “Big Bang” the pros massively outweigh the cons (i.e. it’s not a bad horse …) • Contributors are to be congratulated and PubChem for wrangling them • However, it is important to look closely at the gift horse….. • Users need to understand CNER quirks, pitfalls and confounding artefacts • PubChem slicing and filtering can partially ameliorate these • Activity-to-target mapping for SAR extraction still pinch point • Open extraction is a crucial comparator for commercial efforts • Those without commercial sources are well enabled for patent mining • Those with commercial sources can synergise with open searching 22
  • 23. Info 23 http://cdsouthan.blogspot.com/ many posts have the tag “patents” http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.guidetopharmacology.org/ http://www.sciencedirect.com/science/article/pii/B9780124095472138144
  • 24. Questions? (but wait …. there’s more, a Tuesday tutorial) 24