This document discusses integrating patent chemistry data from SureChem into public research resources like PubChem. SureChem has extracted over 12.8 million chemical structures and 20 million annotated records from US, EP and WO patents as well as journal articles. It plans to deposit all extracted structures into PubChem by the end of 2012 in order to make this previously private patent chemistry data publicly available. This would significantly expand the scope of chemistry in PubChem and advance the goal of a more open and interconnected chemical information network.
1. Integrating patent chemistry with
public research resources
Andrew Hinton, PhD ICIC 2012
Christopher Southan, PhD 17 October
Evan Bolton, PhD
Nicko Goncharoff
2.
3. SureChem Data Collection
Database of automatically mined structure data
from text and images
•20M annotated US, EP, WO full text records
and Japan patent abstracts
•12.8M unique chemical structures
I
•MEDLINE – 19M abstracts (upcoming)
4. Free resource for researchers Professional search needs
Enables linking to public and Data export, alerts, patent family
proprietary content search, chemical relevance filters…
API or Data Feed access to
chemistry & full text
Integrate with internal
databases & workflows
7. SureChem Depositing All*
Structures into PubChem – Q4
2012
•1976 to present
•Deposition of structures only
•Currently ‘on hold’
•Will link to patents in SureChemOpen
* After filtering of fragments and highly common chemistry
8. Compounds Derived from Patents and Literature found in PubChem
By Molecular Weight Range (MWT) and Source
Compounds Dervied from Patents and Literature found in
PubChem Banded by Molecular Weight Range and Source
*8.29M
9,000,000
Drug-like 66%
8,000,000 600-700
Compounds in PubChem
500-600
7,000,000 MWT
6,000,000
400-500
3.99M MWT
5,000,000 3.80M
Drug-like 60%
Drug-like 62%
4,000,000
2.36M 300-400
3,000,000 Drug-like 51% MWT
2,000,000 0.76M
1,000,000
Drug-like 69% 200-300
100-200
0
ChEMBL IBM Thomson SCRIPDB SureChem
Pharma
*Provisional Numbers
Source
16. SureChem Unique Contribution
SureChem Pubchem
96 (ThomsonPharma ,
79
Chemicalize)
Stage No. of Structures
Available from SureChem (SC) 1848
Pre-Exist in PubChem 669
Pre-Exist – not from IC 50 table 573
Pre-Exist – from IC 50 table 96 (12 from TP + 84 via chemicalize.org)
Unique-SC with IC 50 79
Unique-SC – beyond IC 50 table 1100
17. SureChem Chemical Relevance
Filtering
• Frequency counts of chemicals within patents
• Additional molecular property filtering and structural alerts
• Structural identification of “Likely Exemplars”
• Natural Language Processing – based indexing of Exemplified Compounds
Automated indexing of Exemplified Compounds in text
18. Conclusions
SureChem deposition into PubChem:
– Significantly expands public patent chemistry scope
– Contributes unique and timely MedChem-relevant
data
– Enables open drug discovery and chemical biology
– Advances progress toward a more open, federated
chemical information network