SlideShare a Scribd company logo
1 of 32
Navigating between patents, papers,
abstracts and databases using public
          sources and tools



       Christopher Southan1 and Sean Ekins2
         TW2Informatics, Göteborg, Sweden,
   Collaborative Drug Discovery, North Carolina, USA

                   ACS, April 2013




                                                       [1]
[2]
ACS Abstract

Engaging with chemistry in the biosciences requires navigation between
journals, patents, abstracts, databases, Google results and connecting across
millions of structures specified only in text. The ability to do this in public
sources has been revolutionised by several trends a) ChEMBL's capture of SAR
from journals c) the deposition of three major automated patent extractions
(SureChem, IBM and SCRIPDB) in PubChem for over 15 million structures, d)
open tools such as chemicalize.org, OPSIN, and OSCAR that enable the
conversion of IUPAC names or images to structures e) the indexing of chemical
terms (e.g. InChIKeys) that turn Google searches into a merged global
repository of 40 to 50 million structures. Details of these trends, including
PubChem intersect statistics, will be presented, along with practical examples
from selected tools. New structure sharing trends will also be considered such
as patent crowdsourcing, dropbox, blogs, figshare and open lab notebooks.




                                                                                  [3]
Getting chemistry out of text and linking to data:
  some is done but we have to dig for the rest




                                                 [4]
Estimates for chemical text tombs


• Journal chemistry public extraction, ~10 to 20 million entombed ?
• Majority of useful patent chemistry already publically extracted, but, ~5
  to 10 million still to go?
• PubMed abstracts and MeSH chemistry ~ 0.5 million still entombed ?
• Other unique, useful, text-only (i.e. no database cross-references)
  chemistry on the web ~ 0.1 to 0.5 million entombed ?




                                                                          [5]
What’s out there: publically disinterred structures

    •   InChIKey in Google ~ 50 million
    •   PubChem = 48 million
    •   PubChem ROF + 250-800 Mw (lead-like) = 31 million
    •   ChemSpider = 28 million
    •   PubChem all docs (papers & patents) = 16 million
    •   PubChem patents = 15 million
    •   SureChemOpen = 13 million
    •   PubChem journal sources (PubMed + ChEMBL) = 1 million




~90% of all structures in databases have their primary origin in text sources



                                                                                [6]
Medicinal chemistry patents (tombs with lids off)

 • 18,777,229 patents, 2,208,422 WO’s (i.e. ~ 9 per family)
 • WO, C07 or A61= 469,856
 • WO , C07D or A61K = 235,854
 • WO, C07D = 72,737 (assignee vs. year plots below)




                                                              [7]
PubMed at 22 mill:
~ 10% with chemistry (guarded tombs)




      “Free full text” = 575,513 (24%)




                                         [8]
Top-5 Med Chem journals (4% lids off tombs)




             “Free full text” = 2671 (4.3%)
                                              [9]
Growth:
 (escaping the
    tombs)
• Patent “big bang”
  (SureChem &
  SCRIPDB in
  2012)

• Literature “slow
  burn” (ChEMBL
  2009 jump)

• Paradox -
  patents:papers
  15:1

(both sets of CIDs
cumulative)
                     [10]
Patents in PubChem:
         post-bang total vs. unique content




PubChem at 47.3 million CIDs, 32% include patents, 20% patent-only
                                                                     [11]
Citations: connections between tombs
     but still need to disinter structures

Papers                         Abstracts




                              PubMed
              Patents
                              "relatedness"
                              heuristics




                                              [12]
Databases <> structures < > documents:
        links, but few reciprocal

 Papers                       Abstracts

                 0.8 mill
                (ChEMBL)




                 12K        0.2 mill (mainly MeSH)

Patents


            15 mill




                                                     [13]
Post-document retrieval: basic questions

1.    What is the name:IUPAC:image:other ratio in the document?
2.    Which tools might be appropriate for first-pass extractions?
3.    How many and what proportion of strucs can be extracted?
4.    Which SAR /in vivo/clinical data is linked to strucs ?
5.    Which document sections include the key strucs ?
6.    Which database entries have links (back) to this document?
7.    Which strucs have InChIKey matches in Google, & database entries?
8.    Which strucs have synthesis data?
9.    What other documents specify and/or cite this struc ?
10.   Which database records for this struc have links to other documents?
11.   What realtionship connections can be made using similarity searches?
12.   What intersects and differences are discernible within a document set ?



                                                                                [14]
Triaging document or webpage chemistry
• Identify the structure specification types, e.g.
   – Semantic names (all sources)
   – Code names (press releases, papers and abstracts)
   – IUPAC names (papers, patents and abstracts)
   – Images (papers, patents, & Google images)
   – SMILES (open lab books)
   – InChi strings (open lab books)
   – SDF files (open lab books, & github)

Convert these to a structure (e.g. SDF, SMILES, InChI) then:
   – Search InChIKey in Google
   – Search major databases
   – Search SureChemOpen
   – Compare extracted sets for intersects and diffs
   – Extend exact match connectivity with similarity searching
                                                                 [15]
Triage example:
  antimalarial
 starting point



The MMV390048 code
name is linked to an
image in press reports
but is PubChem and
PubMed -ve




                         [16]
Images: convert and search

                      Real chemists sketch them in a jiffy;

   the rest of us can use OSRA: Optical Structure Recognition Application




(after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3)
                                                                            [17]
Making connections:
image > strucure > database > documents




                 CID 53311393 > ChEMBL > PubMed
                 SureChem or chemicalize.org > patent


                                                        [18]
Patent SAR from WO2011086531:
Collating activities via SureChemOpen

     CID 53311393 >




                                        [19]
Patent SAR results: top-20 from 39 IC50s




                                           [20]
Results > figshare




http://figshare.com/articles/Patent_SAR_for_MMV390048/657979
                                                               [21]
Structures > MyNCBI




http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1zWhcobieZ
bIouGfUdsdbHek5/.
                                                                         [22]
SAR Table: iOS app
  from Molecular
     Materials
    Informatics

SureChemOpen strucs ->

manual data collation ->

PubChem CIDs -> SDF ->

Dropbox -> SAR Table

-> edit in data, R-group
decompose

-> share


                           [23]
InChIKey in Google: instant orthogonal joining




                                                 [24]
Chemicalize.org: 413 strucs from WO2011086532



CID 53311393 ->




                                            [25]
Using OPSIN and chemcalize.org to fix
     recalcitrant IUPACs from WO2011086532




Can quasi-manually extract ~ 10 more “split IUPAC” examples
                                                              [26]
Clustering document extraction sets: CheS-Mapper




  WO2011086531 -> chemicalize.org -> 413 cpds download ->
  CheS-Mapper -> cluster 8 -> export 53 cpds

                                                            [27]
PubChem -> ChEMBL -> PMID -> assay -> strucs
                   • CHEMBL2041980 (structure)
                   • PMID 22390538 (paper)
                   • CHEMBL2045642 (assay for 32 strucs
                     from paper)
                   • The 32 CIDs all have patent matches
                   •




                                                       [28]
Venny: intersects, diffs, de-dupes and merges


                                 1) WO2011086531
                                 matches in PubCHem

                                 2) CheS-Mapper
                                 cluster 8 from
                                 WO2011086532

                                 3) ChEMBL assayed
                                 cpds from PMID
                                 22390538

                                 (handles any regular
                                 strings e.g. db IDs,
                                 SMILES, IChI or
                                 InChIKey)


                                                        [29]
The open toolbox facilitates extraction and
  collation of 10 to 30 million structures
             entombed in text




                                              [30]
Conclusions

• The ability to extract chemical structures from text and web sources
  has been transformed by an expansion of the public toolbox
• The PubChem big-bang increases probability of extraction having
  database exact or similarity matches
• Paradoxically, the patent corpus is now completely open while access
  to journal text is still restricted
• However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target
  mapped structures from ~ 50K papers
• The submission of ~15 mill. patent structures to PubChem ensures at
  least representation from the majority of medicinal chemistry patents
  (many of which spawned the subsequent ChEMBL papers)
• Those who want to share their structures globally (e.g. OSDD) have an
  expanding set of options for surfacing their results.



                                                                          [31]
You can find me @...CDD Booth 205
PAPER ID: 13433
PAPER TITLE: “Dispensing processes profoundly impact biological assays and computational and statistical
analyses”
April 8th 8.35am Room 349

PAPER ID: 14750
PAPER TITLE: “Enhancing High Throughput Screening For Mycobacterium tuberculosis Drug Discovery
Using Bayesian Models”
April 9th 1.30pm Room 353
PAPER ID: 21524

PAPER TITLE: “Navigating between patents, papers, abstracts and databases using public sources and
tools”
April 9th 3.50pm Room 350
PAPER ID: 13358

PAPER TITLE: “TB Mobile: Appifying Data on Anti-tuberculosis Molecule Targets”
April 10th 8.30am Room 357

PAPER ID: 13382
PAPER TITLE: “Challenges and recommendations for obtaining chemical structures of industry-provided
repurposing candidates”
April 10th 10.20am Room 350

PAPER ID: 13438
PAPER TITLE: “Dual-event machine learning models to accelerate drug discovery”
April 10th 3.05 pm Room 350                                                                            [32]

More Related Content

Viewers also liked

Advanced querying
Advanced queryingAdvanced querying
Advanced querying
strmpnk
 

Viewers also liked (20)

From geek to event organiser
From geek to event organiserFrom geek to event organiser
From geek to event organiser
 
Drupal for Large scale project
Drupal for Large scale projectDrupal for Large scale project
Drupal for Large scale project
 
Devnest 111115
Devnest 111115Devnest 111115
Devnest 111115
 
Functional Reactive Programming at Booster 2014
Functional Reactive Programming at Booster 2014Functional Reactive Programming at Booster 2014
Functional Reactive Programming at Booster 2014
 
Datatium - radiation free responsive experiences
Datatium - radiation free responsive experiencesDatatium - radiation free responsive experiences
Datatium - radiation free responsive experiences
 
Whither Twitter?
Whither Twitter?Whither Twitter?
Whither Twitter?
 
2011 TDI Conference Social Media Guide
2011 TDI Conference Social Media Guide2011 TDI Conference Social Media Guide
2011 TDI Conference Social Media Guide
 
What may I do with your data? What do I have to do with your data? Policie...
What may I do with your data? What do I have to do with your data? Policie...What may I do with your data? What do I have to do with your data? Policie...
What may I do with your data? What do I have to do with your data? Policie...
 
Présentation de LemonLDAP::NG aux Journées Perl 2016
Présentation de LemonLDAP::NG aux Journées Perl 2016Présentation de LemonLDAP::NG aux Journées Perl 2016
Présentation de LemonLDAP::NG aux Journées Perl 2016
 
Introduction to Perl Best Practices
Introduction to Perl Best PracticesIntroduction to Perl Best Practices
Introduction to Perl Best Practices
 
Enrique Allen, D Fund - Warm Gun Conference
Enrique Allen, D Fund - Warm Gun ConferenceEnrique Allen, D Fund - Warm Gun Conference
Enrique Allen, D Fund - Warm Gun Conference
 
SXSW 2013: How Twitter is Changing How We Watch TV
SXSW 2013: How Twitter is Changing How We Watch TVSXSW 2013: How Twitter is Changing How We Watch TV
SXSW 2013: How Twitter is Changing How We Watch TV
 
Simplicity: UXLx version
Simplicity: UXLx versionSimplicity: UXLx version
Simplicity: UXLx version
 
Make your web apps "Go, Go" like Power Rangers
Make your web apps "Go, Go" like Power RangersMake your web apps "Go, Go" like Power Rangers
Make your web apps "Go, Go" like Power Rangers
 
Advanced querying
Advanced queryingAdvanced querying
Advanced querying
 
Introducing Xapian
Introducing XapianIntroducing Xapian
Introducing Xapian
 
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor AppsLibrato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
 
Morgan e xt_062811
Morgan e xt_062811Morgan e xt_062811
Morgan e xt_062811
 
Responsive Web Design - but for real!
Responsive Web Design - but for real!Responsive Web Design - but for real!
Responsive Web Design - but for real!
 
Combining Context with Signals in the Internet of Things
Combining Context with Signals in the Internet of ThingsCombining Context with Signals in the Internet of Things
Combining Context with Signals in the Internet of Things
 

Similar to Navigatingbetween patents, papers, abstracts and databases using public sources and tools

Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...
Chris Southan
 
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
Dr. Haxel Consult
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
Dr. Haxel Consult
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
Dr. Haxel Consult
 
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
ChemAxon
 

Similar to Navigatingbetween patents, papers, abstracts and databases using public sources and tools (20)

Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
 
A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidata
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChem
 
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity Data
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEs
 
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
 
Connecting Chemists To The Internet Training at Burlington House 2010
Connecting Chemists To The Internet Training at Burlington House 2010Connecting Chemists To The Internet Training at Burlington House 2010
Connecting Chemists To The Internet Training at Burlington House 2010
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
 
The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC Project
 
Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformatics
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
 

More from Sean Ekins

More from Sean Ekins (20)

How to Win a small business grant.pptx
How to Win a small business grant.pptxHow to Win a small business grant.pptx
How to Win a small business grant.pptx
 
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
 
A presentation at the Global Genes rare drug development symposium on governm...
A presentation at the Global Genes rare drug development symposium on governm...A presentation at the Global Genes rare drug development symposium on governm...
A presentation at the Global Genes rare drug development symposium on governm...
 
Leveraging Science Communication and Social Media to Build Your Brand and Ele...
Leveraging Science Communication and Social Media to Build Your Brand and Ele...Leveraging Science Communication and Social Media to Build Your Brand and Ele...
Leveraging Science Communication and Social Media to Build Your Brand and Ele...
 
Bayesian Models for Chagas Disease
Bayesian Models for Chagas DiseaseBayesian Models for Chagas Disease
Bayesian Models for Chagas Disease
 
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
 
Drug Discovery Today March 2017 special issue
Drug Discovery Today March 2017 special issueDrug Discovery Today March 2017 special issue
Drug Discovery Today March 2017 special issue
 
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan DiseasesUsing In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
 
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or ResearchFive Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
 
Open zika presentation
Open zika presentation Open zika presentation
Open zika presentation
 
academic / small company collaborations for rare and neglected diseasesv2
 academic / small company collaborations for rare and neglected diseasesv2 academic / small company collaborations for rare and neglected diseasesv2
academic / small company collaborations for rare and neglected diseasesv2
 
CDD models case study #3
CDD models case study #3 CDD models case study #3
CDD models case study #3
 
CDD models case study #2
CDD models case study #2 CDD models case study #2
CDD models case study #2
 
CDD Models case study #1
CDD Models case study #1 CDD Models case study #1
CDD Models case study #1
 
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
 
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
 
The future of computational chemistry b ig
The future of computational chemistry b igThe future of computational chemistry b ig
The future of computational chemistry b ig
 
#ZikaOpen: Homology Models -
#ZikaOpen: Homology Models - #ZikaOpen: Homology Models -
#ZikaOpen: Homology Models -
 
Slas talk 2016
Slas talk 2016Slas talk 2016
Slas talk 2016
 
Pros and cons of social networking for scientists
Pros and cons of social networking for scientistsPros and cons of social networking for scientists
Pros and cons of social networking for scientists
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Navigatingbetween patents, papers, abstracts and databases using public sources and tools

  • 1. Navigating between patents, papers, abstracts and databases using public sources and tools Christopher Southan1 and Sean Ekins2 TW2Informatics, Göteborg, Sweden, Collaborative Drug Discovery, North Carolina, USA ACS, April 2013 [1]
  • 2. [2]
  • 3. ACS Abstract Engaging with chemistry in the biosciences requires navigation between journals, patents, abstracts, databases, Google results and connecting across millions of structures specified only in text. The ability to do this in public sources has been revolutionised by several trends a) ChEMBL's capture of SAR from journals c) the deposition of three major automated patent extractions (SureChem, IBM and SCRIPDB) in PubChem for over 15 million structures, d) open tools such as chemicalize.org, OPSIN, and OSCAR that enable the conversion of IUPAC names or images to structures e) the indexing of chemical terms (e.g. InChIKeys) that turn Google searches into a merged global repository of 40 to 50 million structures. Details of these trends, including PubChem intersect statistics, will be presented, along with practical examples from selected tools. New structure sharing trends will also be considered such as patent crowdsourcing, dropbox, blogs, figshare and open lab notebooks. [3]
  • 4. Getting chemistry out of text and linking to data: some is done but we have to dig for the rest [4]
  • 5. Estimates for chemical text tombs • Journal chemistry public extraction, ~10 to 20 million entombed ? • Majority of useful patent chemistry already publically extracted, but, ~5 to 10 million still to go? • PubMed abstracts and MeSH chemistry ~ 0.5 million still entombed ? • Other unique, useful, text-only (i.e. no database cross-references) chemistry on the web ~ 0.1 to 0.5 million entombed ? [5]
  • 6. What’s out there: publically disinterred structures • InChIKey in Google ~ 50 million • PubChem = 48 million • PubChem ROF + 250-800 Mw (lead-like) = 31 million • ChemSpider = 28 million • PubChem all docs (papers & patents) = 16 million • PubChem patents = 15 million • SureChemOpen = 13 million • PubChem journal sources (PubMed + ChEMBL) = 1 million ~90% of all structures in databases have their primary origin in text sources [6]
  • 7. Medicinal chemistry patents (tombs with lids off) • 18,777,229 patents, 2,208,422 WO’s (i.e. ~ 9 per family) • WO, C07 or A61= 469,856 • WO , C07D or A61K = 235,854 • WO, C07D = 72,737 (assignee vs. year plots below) [7]
  • 8. PubMed at 22 mill: ~ 10% with chemistry (guarded tombs) “Free full text” = 575,513 (24%) [8]
  • 9. Top-5 Med Chem journals (4% lids off tombs) “Free full text” = 2671 (4.3%) [9]
  • 10. Growth: (escaping the tombs) • Patent “big bang” (SureChem & SCRIPDB in 2012) • Literature “slow burn” (ChEMBL 2009 jump) • Paradox - patents:papers 15:1 (both sets of CIDs cumulative) [10]
  • 11. Patents in PubChem: post-bang total vs. unique content PubChem at 47.3 million CIDs, 32% include patents, 20% patent-only [11]
  • 12. Citations: connections between tombs but still need to disinter structures Papers Abstracts PubMed Patents "relatedness" heuristics [12]
  • 13. Databases <> structures < > documents: links, but few reciprocal Papers Abstracts 0.8 mill (ChEMBL) 12K 0.2 mill (mainly MeSH) Patents 15 mill [13]
  • 14. Post-document retrieval: basic questions 1. What is the name:IUPAC:image:other ratio in the document? 2. Which tools might be appropriate for first-pass extractions? 3. How many and what proportion of strucs can be extracted? 4. Which SAR /in vivo/clinical data is linked to strucs ? 5. Which document sections include the key strucs ? 6. Which database entries have links (back) to this document? 7. Which strucs have InChIKey matches in Google, & database entries? 8. Which strucs have synthesis data? 9. What other documents specify and/or cite this struc ? 10. Which database records for this struc have links to other documents? 11. What realtionship connections can be made using similarity searches? 12. What intersects and differences are discernible within a document set ? [14]
  • 15. Triaging document or webpage chemistry • Identify the structure specification types, e.g. – Semantic names (all sources) – Code names (press releases, papers and abstracts) – IUPAC names (papers, patents and abstracts) – Images (papers, patents, & Google images) – SMILES (open lab books) – InChi strings (open lab books) – SDF files (open lab books, & github) Convert these to a structure (e.g. SDF, SMILES, InChI) then: – Search InChIKey in Google – Search major databases – Search SureChemOpen – Compare extracted sets for intersects and diffs – Extend exact match connectivity with similarity searching [15]
  • 16. Triage example: antimalarial starting point The MMV390048 code name is linked to an image in press reports but is PubChem and PubMed -ve [16]
  • 17. Images: convert and search Real chemists sketch them in a jiffy; the rest of us can use OSRA: Optical Structure Recognition Application (after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3) [17]
  • 18. Making connections: image > strucure > database > documents CID 53311393 > ChEMBL > PubMed SureChem or chemicalize.org > patent [18]
  • 19. Patent SAR from WO2011086531: Collating activities via SureChemOpen CID 53311393 > [19]
  • 20. Patent SAR results: top-20 from 39 IC50s [20]
  • 23. SAR Table: iOS app from Molecular Materials Informatics SureChemOpen strucs -> manual data collation -> PubChem CIDs -> SDF -> Dropbox -> SAR Table -> edit in data, R-group decompose -> share [23]
  • 24. InChIKey in Google: instant orthogonal joining [24]
  • 25. Chemicalize.org: 413 strucs from WO2011086532 CID 53311393 -> [25]
  • 26. Using OPSIN and chemcalize.org to fix recalcitrant IUPACs from WO2011086532 Can quasi-manually extract ~ 10 more “split IUPAC” examples [26]
  • 27. Clustering document extraction sets: CheS-Mapper WO2011086531 -> chemicalize.org -> 413 cpds download -> CheS-Mapper -> cluster 8 -> export 53 cpds [27]
  • 28. PubChem -> ChEMBL -> PMID -> assay -> strucs • CHEMBL2041980 (structure) • PMID 22390538 (paper) • CHEMBL2045642 (assay for 32 strucs from paper) • The 32 CIDs all have patent matches • [28]
  • 29. Venny: intersects, diffs, de-dupes and merges 1) WO2011086531 matches in PubCHem 2) CheS-Mapper cluster 8 from WO2011086532 3) ChEMBL assayed cpds from PMID 22390538 (handles any regular strings e.g. db IDs, SMILES, IChI or InChIKey) [29]
  • 30. The open toolbox facilitates extraction and collation of 10 to 30 million structures entombed in text [30]
  • 31. Conclusions • The ability to extract chemical structures from text and web sources has been transformed by an expansion of the public toolbox • The PubChem big-bang increases probability of extraction having database exact or similarity matches • Paradoxically, the patent corpus is now completely open while access to journal text is still restricted • However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target mapped structures from ~ 50K papers • The submission of ~15 mill. patent structures to PubChem ensures at least representation from the majority of medicinal chemistry patents (many of which spawned the subsequent ChEMBL papers) • Those who want to share their structures globally (e.g. OSDD) have an expanding set of options for surfacing their results. [31]
  • 32. You can find me @...CDD Booth 205 PAPER ID: 13433 PAPER TITLE: “Dispensing processes profoundly impact biological assays and computational and statistical analyses” April 8th 8.35am Room 349 PAPER ID: 14750 PAPER TITLE: “Enhancing High Throughput Screening For Mycobacterium tuberculosis Drug Discovery Using Bayesian Models” April 9th 1.30pm Room 353 PAPER ID: 21524 PAPER TITLE: “Navigating between patents, papers, abstracts and databases using public sources and tools” April 9th 3.50pm Room 350 PAPER ID: 13358 PAPER TITLE: “TB Mobile: Appifying Data on Anti-tuberculosis Molecule Targets” April 10th 8.30am Room 357 PAPER ID: 13382 PAPER TITLE: “Challenges and recommendations for obtaining chemical structures of industry-provided repurposing candidates” April 10th 10.20am Room 350 PAPER ID: 13438 PAPER TITLE: “Dual-event machine learning models to accelerate drug discovery” April 10th 3.05 pm Room 350 [32]

Editor's Notes

  1. 70 million substances in CAS suggest a 20-30 million shortfall (i.e. SciFinder only) but they include virtualsand librariesSureChen will continue patent extraction but expect an asymtote of true novels only soonPubMed capture largely dependant on MeSH but a lot of IUPAC chemistry is only anually updated, and some not capturedSureChem, IBM and chemicalize all inticate that, including MeSH terms at least 0.5 million structures could be extracted from PubMedNo idea how much web-unique chemistry (not in documents or databases) is out there but open lab books will increase this
  2. IinChIKeys - estimate of PubChem + ChemSpider in Google – but PubChem currently has a backlog for Key scrapingThe ROF + 250-800 is a very approximate circumscription of the property space that has some possibility of bioactivityProbably a proportion of vendor structures may have never been committed to textThere are some virtuals “out there” including some patent-extractions but difficult to estimate
  3. Note the WO/PCT queries are non-redundant in the patent family senseThe medicinal chemistry corpus is actually quite smallNote big pharma patent decline post-2008 Average exemplified cpds with activity data per patent (family) is unknown but GVKs curation average is ~ 50
  4. Using the top level MeSH term as a filter for “PubMeds with some chemistry”Free full text is ¨ ¼ but there are a lot of biological journals in this set
  5. Select the core journals used for med chem extraction by GVKBIO and ChEMBL. Not a large corpus Both extract ~ 15 cpds per paperNote the proportion of “free full text” is low
  6. Note that cumulative plots include an element of back-mapping i.e. the 2005 matches are to the 2013 total not the just the 2005 documents
  7. PubChem hit 15 million patents in March 2013Largest unique content is SureChemOpen Thomson uniqueness low because a) they include at least 30% journal extractions and b) the Derwent WPI content (was) also in Discovery gateIBM are only pre-2000 patents and the extracted content overlaps with other sources.
  8. Citations are a core tradition but they do not provide direct structure &lt;-&gt; structure linksPatents cite papers but papers rarely cite patents (with the exception of patent reviews)
  9. Only Nature Chemical Biology and Nature Chemistry have direct links from the journal document to PubChemGiven todays technology the major patent offices could put links in the PDFs but are unlikely to do so
  10. The problem “how do I find the chemistry out there relevant to my interests” is a general search retrieval recall and specificity challenge. cannot be addressed here. Beyond PubMed and Google it’s getting better (e.g. indexing of full text patents) but there are still issues (e.g. text mining of chemical journals still very restricted)Once you have found the documents or text, these are the typical set of questions you might want to address, especially in regard to choosing which tools are best for the job.
  11. Need to assess what representational types are being used in the documentEg. Some patents are image-only (but SureChem is pulling most of these out)Then select tools and sources for the job ´Decide how to store your structures locally The default batch search is an upload to PubChemThe default individual search is the InChIKey against Google
  12. Self explanatoryNote my blog post was indexed
  13. The simplest of starting points, at least the press release had a structure diagram OSRA provides good starting points to edit and get SMILESThe structure does not have to be exactly right because a database similarity match is OK to see what it should have been
  14. SMILES from the image hits the CID in PubChemThis links to patents via SureChem and chemicalize.orgChEMBL provides a link to the paper Note none of these sources have MMV390048 as synonym so all the connections are via structure
  15. We can start of with patent linksNote in this case numbered image capture, as oposed to the IUPAC listing, was important to manually collate the structure against the correct IC50
  16. From manual cross-checking between the individual example structures and the IC50 table the Excel sheet can be populated
  17. Useful way to share results that is citableIndexed in Google but no live links in Excel sheet (yet)
  18. Can upload CID lists and download as a saved and public collection
  19. This is the Pistoia /AlexClark SAR Table appDropped the CIDs out of PubChem into DropBox and picked them up on the IPADNice but would be good to automate the decomposition
  20. InChIkey search picks up instantly This was just a choice of one of the activesSo this connects PubChem and figshare
  21. The CID links straight throught to chemicalize and will just re-extract the whole patent in a few seconds The 413 gave 358 hits in pub chem
  22. IUPAC names have a lot of usage variants and OCR mistakes Typically gaps, line breaks 1 instead of 1 and missing bracketsOPSIN is good for indicating where the break is This can then be fixed for a series in chemicalize.org
  23. Total extractions from patents can include a lot of low Mw common reagent chemistryCheS mapper display makes it easy to pick out clusters of lead-like compoundsClusters can then be downloadedFlexibility is high because document sets can be split or merged at the imput stage
  24. ChEMBL extracts structure and dataCant actually select a set of cpds via the PubMed ID but can via the assay ID that is usually unique to that paperIn this case we got 32 structures, all of which came from that patent
  25. Very useful utility for any kind of set operations e.g. sets of extractions Total flexibility e.g. intersecting patents and papers with extractions from abstract setsSets can be de-duplicatedand merged from multiple sets (e.g. 10 patent extractions in one box)Can combine with selected downloaded database records