EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOpen, PubChem and the InChIKey: A heavenly conjunction with transformative utility
The ChemAxon Name to Structure functionality is not only a component of the SureChem patent extraction pipeline but also powers chemicalize.org. Both operations are now submitting sources to PubChem. The former has deposited structures that bring the patent-extracted total in PubChem to 14.5 mill. CIDs. The deposition from chemicalize is ~0.3 mill., but has been actively selected by users and is 20% unique. The final conjunction is that all three sources generate the InChIKey that turns Google into a de facto merge of PubChem and ChemSpider of ~50 mill. structures. Chemicalize.org users can convert new patents, other external or internal documents and web based text. Individual results can be Googled, searched against SureChemOpen and bulk extractions triaged against PubChem. It thus becomes possible to connect chemistry between patents, papers, abstracts and database records via exact match or similarity searching. When SureChem and chemicalize.org update their submissions, relationships with the other 47 million structures from ~200 PubChem sources (including ChEMBL and vendor databases) are re-computed and new CID links made. The synergy between SureChem and chemicalize.org is powerful because matches between them (~ 0.15 mill.) via SureChemOpen, give occurrence statistics and the location of the structure within patents. The applications of chemicalize.org are extended by web tools such as Venny for determining intersects from multiple extractions and CheS-Mapper for cluster visualization. These utility expansions will be illustrated by documents specifying BACE1 inhibitors for Alzheimer’s disease.
Similaire à EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOpen, PubChem and the InChIKey: A heavenly conjunction with transformative utility
Towards Responsible Content Mining: A Cambridge perspectivepetermurrayrust
Similaire à EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOpen, PubChem and the InChIKey: A heavenly conjunction with transformative utility (20)
WordPress Websites for Engineers: Elevate Your Brand
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOpen, PubChem and the InChIKey: A heavenly conjunction with transformative utility
1. [1]
Chemicalize.org, SureChemOpen, PubChem and
the InChIKey: A heavenly conjunction with
transformative utility
Christopher Southan, TW2Informatics, Göteborg, Sweden,
ChemAxon UGM, Budapest, May 2013
Image credit: http://www.eso.org/public/images/yb_vlt_moon_cnn_cc/
3. [3]
The ChemAxon name-to-struc functionality is not only a component of the SureChem
patent extraction pipeline but also powers chemicalize.org. Both operations are now
submitting sources to PubChem. The former has deposited structures that bring the
patent-extracted total in PC to 14.5 mill. CIDs. The deposition from chemicalize is
~0.3 mill., but has been actively selected by users and is 20% unique. The final
conjunction is that all three sources generate the InChIKey (IK) that turns Google into
a de-facto merge of PubChem and ChemSpider of ~50 mill. structures.
Chemicalize.org users can convert new patents, other external or internal documents
and web based text. Individual results can be Googled, searched against
SurChemOpen and bulk extractions triaged against PubChem. It thus becomes
possible to connect chemistry between patents, papers, abstracts and database
records via exact match or similarity searching. When SureChem and
chemicalize.org update their submissions, relationships with the other ~200 PubChem
sources (including ChEMBL and vendor databases) are re-computed and new CID
links made. The synergy between SureChem and chemicalize.org is powerful because
matches between them (~ 0.15 mill.) via SureChemOpen, give occurrence statistics
and the location of the structure within patents. The applications of chemicalize.org
are extended by web tools such as Venny for determining intersects from multiple
extractions and CheS-Mapper for cluster visualization. These utility expansions will be
illustrated by documents specifying BACE1 inhibitors for Alzheimer’s disease.
Abstract
4. [4]
Auspicious Conjunctions 2012-13
• PubChem: global chemistry to slice ‘n dice
• SureChemOpen: majority of patent chemistry opened up
• Chemicalize.org : chemistry extractable from any text toombs
• Chemical images: patents extracted in SureChemOpen, OSRA
handles papers
• InChIKey indexing in Google
• ChemSpider: crowdsourcing chemisty quality
• Exapnding toolbox e.g.OPSIN, Venny, Ches-mapper
• SciBite alerts
• Expanding preview and surfacing options e.g. ChEMBLntd, Github,
OSDD, Open Lab Books, figshare etc
• Rise of mobile chemistry
16. [16]
Conclusions
• Transformative opening up of chemistry > biology via structure >document
connectivity
• Open mining of patent metadata and data
• Expanding toolbox
• Inexorable expansion of open-access publishing
But;
• Journal chemistry extraction > database records still slow
• Text mining of journals still restricted
• Author annotation and direct db submission rare
• Pharmaceutical research publications are still blinding structures (see
PMID: 23159359)