My presentation for the Drug Repurposing workshop at the upcoming Bio-IT World Expo.
http://www.bio-itworldexpo.com/Bio-It_Expo_Content.aspx?id=124256
Presentation abstract:
PubChem has a wealth of chemical structure and biological activity information. In conjunction with NCBI’s other resources such as PubMed and GenBank, PubChem is a vast source of information relevant to repurposing not only of established drugs but any compounds with in vivo pharmacology and/or clinical results. The challenge is how to take advantage of this knowledge. The ability to explore not only chemical similarity but relationships between diseases and disease targets has crucial value in repurposing. While focused investigations are already possible within the existing Entrez system, navigation across these linked information spaces can be difficult to do on a large scale with current tools. We are actively developing new infrastructure to support such analyses, and pursuing new methods of exploring inter- and intra-database relationships between chemicals, targets, diseases, and patents. Progress and some future direction in these areas will be presented.
2. What is a “Knowledge Space”?
May be a database
But may be a concept not encapsulated
in a database
Genes Diseases
Literature
(PubMed) Chemicals
(PubChem)
Assays Targets
Patents (PubChem) (sequences)
Drugs
3. Connecting the Spaces
Database cross-links
Assays
(PubChem)
Literature
(PubMed)
Active
MeSH
Inactive
Targets
Depositor (sequences)
Chemicals
(PubChem)
4. Moving Within a Space
Neighbors… some examples
Same Similar sets
parent of screened
Similar Assays chemicals
by 2D (PubChem)
or 3D Chemicals
(PubChem)
Similar
target
Same (BLAST)
connectivity
5. Drug Repurposing as a Spatial
Transformation
One possible route…
Search Diseases
Drugs
(known)
Similarity
Diseases Targets
(hypothesized)
6. What is in PubChem
117M Substances (SIDs)
Information from depositors, including links
to PubMed, sequences, structures, patents,
etc.
47M Compounds (CIDs)
Derived from Substances (including links)
Computed properties
650k Assays (AIDs)
~200M test results on SIDs
Links to target sequences
7. Some PubChem Statistics
All CIDs 46,814,409
Unique parents by connectivity 36,806,372
Rule of 5 34,343,056
Rule of 5 but MW 250-800 31,483,865
Active in any BioAssay 824,028
Tested in any BioAssay 1,872,313
Experimental 3D (mainly PDB) 41,406
Computed 3D (multiple confs + neighbors) 42,252,570
Pharmacological Actions 11,531
Biosystems 9,703
Chemical vendors 28,852,943
NIH Molecular Libraries 402,076
Patent sources 14,512,499
Patent links 5,978,538
… as of 2013/03/20
8. What is in NCBI Entrez
Many other databases…
PubMed
Protein/Nucleotide sequences
Genes
Biosystems (metabolic pathways)
PDB structures (with VAST neighbors)
Text and numeric search fields
Cross-links
Between databases
Within databases (neighbors)
9. How Entrez Works
Search results = list of identifiers
Boolean operations on lists (query
refinement)
Links from one database to another
PubChem CID
Search List
Link
PMID
to PubMed List
PubChem CID
Search List
10. Limitations of Entrez
Only text or numeric search
Search fields hard to discover
Search fields and defaults vary by database
Chemical structure search, and other
specialized algorithms, must be done
outside Entrez
The kicker: links are incomplete
Only 500-10,000 ids!
Limit also varies by database
11. Working Around the Limitations
Scripting
E-Utils, PUG SOAP/REST, etc.
Break queries into smaller chunks
Specialized services
PubChem’s ID Exchange
Classification trees (with associated IDs)
12. What is not in Entrez
… as a database per se, but which may
be imported and linked to PubChem
Drugs
(sort of but not really)
Targets
(again sort of)
Diseases
Patents
13. Some Public Sources of Information
Relevant to Drugs and Repurposing
United States (FDA, NLM, NCBI, …)
ClinicalTrials.gov
NDF(-RT)
RxNorm
HSDB
MeSH
DailyMed
PubMed, PubMed Health
USPTO
Europe
ChEBI / ChEMBL
EPO / WIPO
Canada
DrugBank
Japan
KEGG
… not an exhaustive list
… some are linked to PubChem
… some are works in progress
14. MeSH and ChEBI
Chemical
structure
classification
Biological role
Pharmacological
action
22. Big Classifications…
Some Engineering
Required
WIPO IPC
• 72,000 tree
nodes
• 6,000,000 CIDs
• 124,000,000
node-CID links
Filtering on the fly:
• 22,000 CIDs
from PDB
… interactive!
23. More Space to Explore
Genes
Literature Assays
(PubMed) (PubChem)
Chemicals
(PubChem)
Targets
Patents (sequences)
Drugs Diseases … and beyond
24. Conclusions
PubChem is…
A very generalized system
Based on open data
Part of the larger Entrez collection
We strive to…
Make analysis across multiple knowledge
spaces accessible and powerful
Enable hypothesis generation for drug
repurposing (as one scenario among many)
Feedback is always welcome!
info@ncbi.nlm.nih.gov