Using publicly available resources to build a comprehensive knowledgebase of chemical information

•

1 j'aime•193 vues

There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated. This poster will outline the work performed on text and data-mining various public resources on the Web, data curation process and making this information publicly available through a portal and a RESTful API. We will also demonstrate how such knowledgebase can be used for real-time QSAR and QSPR predictions.

Sciences

Using publicly available resources
to build a comprehensive knowledgebase of chemical information
by B. Sattarov, R. Zakharov and V.Tkachenko
Science Data Software
Abstract
There is a variety of public resources on the Internet which contain information about
various aspects of chemical, biological and pharmaceutical domains. The quality,
maturity, hosting organizations, team sizes behind these data resources vary wildly and
as a consequence content cannot be always trusted and the effort of extracting
information and preparing it for reuse is repeated again and again at various levels.
This problem is especially serious in applications for QSAR, QSPR and QNAR modeling.
On the other hand authors of this poster believe, based on their own extensive
experience building various types of chemical, analytical and biological databases for
decades, that the process of building such knowledgebase can be systematically
described and automated tool for building a comprehensive knowledgebase of
chemical information.
We have developed data mining workflow to collect and standardize chemical data
from open sources, using several simple python scripts which will be included in open
source library. Data collection was carried out by HTML parsing and by using
ChemSpider API. We also used python version of Chemical Validation and
Standardization Platform developed by us to standardize collected data.
Our ChemScrapper allowed us to resolve 19.85% names of
biologically active compounds from MESH 2017 dataset and to save
this data into json and handy sdf format.
Chemical Validation and Standardization Platform
(CVSP),
which we used to standardize chemical structures, can also be used as
stand-alone platform for SMIRKS-based standardization of any dataset,
thanks to the visual implementation of its python version functionality on
Jupyter.
You can see every standardization
rule applied as SMIRKS string simply
by clicking on SMIRKS button as
well as download standardized
dataset as *.sdf file by checking
corresponding folder.
Example json—output with mol block, properties and synonyms
Example CLI
Example input
One of the most productive data mining tools we have created works with ChemSpider web
API. It allows user, looking for a chemical structures/data, to deal only with convenient com-
mand line interface written in Python, in order to resolve chemicals names and identifiers or
to find new data for QSAR/QSPR analysis or any other purpose that requires .
API
HTML Parsing
CVSP
Standardization
Data collection
Comprehensive
knowledgebase of
chemical information
Open Science Data Repository (OSDR)
Comprehensive distributed semantic knowledgebase of scientific information
with built-in Machine Learning capabilities

Contenu connexe

Tendances

Accelerating GWAS epistatic interaction analysis methodsPriscill Orue Esquivel

Open chemistry registry and mapping platform based on open source cheminforma...Valery Tkachenko

Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho

Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain

PNNL April 2011 ogcemarpierc

Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...Syed Ahmad Chan Bukhari, PhD

CSHALS 2013Alejandra Gonzalez-Beltran

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier

2016 davis-plantbioc.titus.brown

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier

Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...Ahmad C. Bukhari

BioNLPSADISyed Ahmad Chan Bukhari, PhD

An examination of data quality on QSAR Modeling in regards to the environment...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

NETTAB 2013Alejandra Gonzalez-Beltran

The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain

Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Dominic Suciu

ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier

TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost

Rethinking data intensive science using scalable analytics systemsnewmooxx

CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...Syed Ahmad Chan Bukhari, PhD

Tendances (20)

Accelerating GWAS epistatic interaction analysis methods

Open chemistry registry and mapping platform based on open source cheminforma...

Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Evaluating Machine Learning Algorithms for Materials Science using the Matben...

PNNL April 2011 ogce

Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...

CSHALS 2013

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...

2016 davis-plantbio

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...

Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...

BioNLPSADI

An examination of data quality on QSAR Modeling in regards to the environment...

NETTAB 2013

The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data

Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...

ReComp, the complete story: an invited talk at Cardiff University

TMS workshop on machine learning in materials science: Intro to deep learning...

Rethinking data intensive science using scalable analytics systems

CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...

Similaire à Using publicly available resources to build a comprehensive knowledgebase of chemical information

Free online access to experimental and predicted chemical properties through ...Kamel Mansouri

The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

The Benefits to Chemical Vendors of Putting their data on ChemSpiderUS Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

The Role of Metadata in Reproducible Computational ResearchJeremy Leipzig

The influence of data curation on QSAR Modeling – examining issues of qualit...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Implementing chemistry platform for OpenPHACTSValery Tkachenko

PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...Araport

The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Delivering web-based access to data and algorithms to support computational t...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

ChemSpider Overview SLides August 2007US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Chem spider introduction spring 2011Royal Society of Chemistry

Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpiderUS Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Chemistry data delivery from the US-EPA to support environmental chemistryUS Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

User manualsoban haris

Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

2011-11-28 Open PHACTS at RSC CICAGopen_phacts

How an Online Resource for Chemistry Can Change Our WorldUS Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

ChemSpider as a hub for online chemical information resources US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Similaire à Using publicly available resources to build a comprehensive knowledgebase of chemical information (20)

Free online access to experimental and predicted chemical properties through ...

The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...

The Benefits to Chemical Vendors of Putting their data on ChemSpider

The Role of Metadata in Reproducible Computational Research

The influence of data curation on QSAR Modeling – examining issues of qualit...

Implementing chemistry platform for OpenPHACTS

PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...

The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...

The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...

Delivering web-based access to data and algorithms to support computational t...

ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...

ChemSpider Overview SLides August 2007

Chem spider introduction spring 2011

Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider

Chemistry data delivery from the US-EPA to support environmental chemistry

User manual

Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...

2011-11-28 Open PHACTS at RSC CICAG

How an Online Resource for Chemistry Can Change Our World

ChemSpider as a hub for online chemical information resources

Plus de Valery Tkachenko

Evolution of public chemistry databases: past and the futureValery Tkachenko

In silico design of new functional materialsValery Tkachenko

Metal-organic frameworks: from database to supramolecular effects in complexa...Valery Tkachenko

Abstract recommendation system: beyond word-level representationsValery Tkachenko

Machine learning methods for chemical properties and toxicity based endpointsValery Tkachenko

Chemical workflows supporting automated research data collectionValery Tkachenko

Deep learning methods applied to physicochemical and toxicological endpointsValery Tkachenko

Need and benefits for structure standardization to facilitate integration and...Valery Tkachenko

Development and comparison of deep learning toolkit with other machine learni...Valery Tkachenko

Living in a world of federated knowledge challenges, principles, tools and ...Valery Tkachenko

Using the structured product labeling format to index versatile chemical dataValery Tkachenko

Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko

Chemistry Validation and Standardization Platform v2.0Valery Tkachenko

Open Science Data Repository - the platform for materials researchValery Tkachenko

Opportunities in chemical structure standardizationValery Tkachenko

OpenPHACTS - Chemistry Platform Update and LearningsValery Tkachenko

Evolution of open chemical informationValery Tkachenko

OMPOL – visualisation of large chemical spacesValery Tkachenko

Not just another reaction databaseValery Tkachenko

Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko

Plus de Valery Tkachenko (20)

Evolution of public chemistry databases: past and the future

In silico design of new functional materials

Metal-organic frameworks: from database to supramolecular effects in complexa...

Abstract recommendation system: beyond word-level representations

Machine learning methods for chemical properties and toxicity based endpoints

Chemical workflows supporting automated research data collection

Deep learning methods applied to physicochemical and toxicological endpoints

Need and benefits for structure standardization to facilitate integration and...

Development and comparison of deep learning toolkit with other machine learni...

Living in a world of federated knowledge challenges, principles, tools and ...

Using the structured product labeling format to index versatile chemical data

Tools and approaches for data deposition into nanomaterial databases

Chemistry Validation and Standardization Platform v2.0

Open Science Data Repository - the platform for materials research

Opportunities in chemical structure standardization

OpenPHACTS - Chemistry Platform Update and Learnings

Evolution of open chemical information

OMPOL – visualisation of large chemical spaces

Not just another reaction database

Building linked data large-scale chemistry platform - challenges, lessons and...

Dernier

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha

Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6

Formation of low mass protostars and their circumstellar disksSérgio Sacani

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls

Engler and Prantl system of classification in plant taxonomyNistarini College, Purulia (W.B) India

Green chemistry and Sustainable development.pptxRajatChauhan518211

Isotopic evidence of long-lived volcanism on IoSérgio Sacani

Nanoparticles synthesis and characterization kaibalyasahoo82800

Biological Classification BioHack (3).pdfmuntazimhurra

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter

Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136

DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi

CELL -Structural and Functional unit of life.pdfNistarini College, Purulia (W.B) India

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari

GBSN - Biochemistry (Unit 1)Areesha Ahmad

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344

Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1

GBSN - Microbiology (Unit 2)Areesha Ahmad

The Philosophy of ScienceUniversity of Hertfordshire

Dernier (20)

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000

Biopesticide (2).pptx .This slides helps to know the different types of biop...

Formation of low mass protostars and their circumstellar disks

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR

Engler and Prantl system of classification in plant taxonomy

Green chemistry and Sustainable development.pptx

Isotopic evidence of long-lived volcanism on Io

Nanoparticles synthesis and characterization

Biological Classification BioHack (3).pdf

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx

Cultivation of KODO MILLET . made by Ghanshyam pptx

DIFFERENCE IN BACK CROSS AND TEST CROSS

CELL -Structural and Functional unit of life.pdf

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...

GBSN - Biochemistry (Unit 1)

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...

Recombinant DNA technology (Immunological screening)

GBSN - Microbiology (Unit 2)

The Philosophy of Science

Using publicly available resources to build a comprehensive knowledgebase of chemical information

1. Using publicly available resources to build a comprehensive knowledgebase of chemical information by B. Sattarov, R. Zakharov and V.Tkachenko Science Data Software Abstract There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated tool for building a comprehensive knowledgebase of chemical information. We have developed data mining workflow to collect and standardize chemical data from open sources, using several simple python scripts which will be included in open source library. Data collection was carried out by HTML parsing and by using ChemSpider API. We also used python version of Chemical Validation and Standardization Platform developed by us to standardize collected data. Our ChemScrapper allowed us to resolve 19.85% names of biologically active compounds from MESH 2017 dataset and to save this data into json and handy sdf format. Chemical Validation and Standardization Platform (CVSP), which we used to standardize chemical structures, can also be used as stand-alone platform for SMIRKS-based standardization of any dataset, thanks to the visual implementation of its python version functionality on Jupyter. You can see every standardization rule applied as SMIRKS string simply by clicking on SMIRKS button as well as download standardized dataset as *.sdf file by checking corresponding folder. Example json—output with mol block, properties and synonyms Example CLI Example input One of the most productive data mining tools we have created works with ChemSpider web API. It allows user, looking for a chemical structures/data, to deal only with convenient com- mand line interface written in Python, in order to resolve chemicals names and identifiers or to find new data for QSAR/QSPR analysis or any other purpose that requires . API HTML Parsing CVSP Standardization Data collection Comprehensive knowledgebase of chemical information Open Science Data Repository (OSDR) Comprehensive distributed semantic knowledgebase of scientific information with built-in Machine Learning capabilities

Using publicly available resources to build a comprehensive knowledgebase of chemical information

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Using publicly available resources to build a comprehensive knowledgebase of chemical information

Similaire à Using publicly available resources to build a comprehensive knowledgebase of chemical information (20)

Plus de Valery Tkachenko

Plus de Valery Tkachenko (20)

Dernier

Dernier (20)

Using publicly available resources to build a comprehensive knowledgebase of chemical information