SlideShare une entreprise Scribd logo
1  sur  1
Innovative Research for a Sustainable Futurewww.epa.gov/research
ORCID: 0000-0002-2668-4821
Antony Williams l williams.antony@epa.gov l 919-541-1033
The EPA Online Database of Experimental and Predicted Data to Support Environmental Scientists
Antony Williams1,*, Christopher Grulke1, Kamel Mansouri2, Jennifer Smith1, Jordan Foster1, Michelle Krzyzanowski and Jeff Edwards1
1. U.S. Environmental Protection Agency, Office of Research and Development, National Center for Computational Toxicology (NCCT), Research Triangle Park, NC
2. Oak Ridge Institute for Science and Education (ORISE) Participant, Research Triangle Park, NC
Problem Definition and Goals
Source Data and Example Errors
Automated Analysis Using KNIME
Accessing ~1 Million Predicted Properties Online
Future Work
As part of our efforts to develop a public platform to provide access to experimental data and predictive models
for physicochemical data we have delivered a web-based database, the EPA Chemistry Dashboard
(https://comptox.epa.gov). In order to develop the prediction models we used publicly available data and a
combination of manual and automated review processes to produce highly curated datasets. The automated
curation processes for the validation of the data used a KNIME workflow and included approaches to validate
different chemical structure representations (e.g. molfile and SMILES), identifiers (chemical names and registry
numbers), and methods to standardize the data into QSAR-consumable formats for modeling. Machine-learning
approaches were applied to create a series of models that have been used to generate predicted
physicochemical and environmental parameters for over 700,000 chemicals. The data have been made available
online via the EPA’s iCSS Chemistry Dashboard and includes detailed model reports regarding whether or not a
chemical is contained within a particular applicability domain, provides access to nearest neighbors within the
training set as well as detailed QMRF (QSAR Modeling Report Format) reports for each of the models.
The iCSS Chemistry Dashboard (ICD) is a public EPA-hosted web application providing access to over 700,000
chemicals from EPA’s DSSTox database5. It integrates experimental and predicted data and is a hub to other
NCCT apps and web-based resources. The 3 and 4 STAR curated experimental data are accessible via the ICD
application. All chemicals were also passed through the prediction models and detailed model reports and results
are freely available at http://comptox.epa.gov/dashboard.
The manual investigation of the data allowed us to develop a KNIME2 workflow for automated
processing. This workflow was derived from earlier work by Mansouri et. al.3 and is represented in the
figure below as a series of blocks representing, for example:
The KNIME workflow for automated processing of PHYSPROP data
The KNIME workflow was used to insert various levels of Quality Flags indicating consistency between
chemical structure formats and identifiers. The consistency flag definitions and distribution are
summarized below for the >15k chemicals.
Predictive models were developed using only the 3 and 4 star curated data.
• The PHYSPROP data were sourced online1 as SDF files. A total of 13 endpoints were represented, including
LogKow, water solubility, melting point, boiling point, and others. The largest dataset, LogKow, contained over
15,800 individual chemicals. Each data point included the Mol-Block, SMILES, CASRN, Name, LogKow value
and, where available, a reference. This dataset was chosen as representative for examining the quality of
data.
• A manual examination of the data revealed a number of issues: e.g., SMILES and Mol-Block did not agree;
CASRN did not match the correct structure; SMILES could not be converted; a single chemical structure
would be listed multiple times with different property values. Example errors across various PHYSPROP
datasets are shown below.
References
1. PHYSPROP Data: http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm
2. KNIME: https://www.knime.org/
3. Mansouri et al. CERAPP: Collaborative Estrogen Receptor Activity Prediction Project, Environ Health
Perspect; DOI:10.1289/ehp.1510267
4. PaDEL descriptors, http://padel.nus.edu.sg/software/padeldescriptor/
5. EPA Distributed Structure-Searchable Toxicity (DSSTox) Database, http://www.epa.gov/chemical-
research/distributed-structure-searchable-toxicity-dsstox-database
6. EPA Toxicity Estimation Software Tool (T.E.S.T.) software http://www.epa.gov/chemical-research/toxicity-
estimation-software-tool-test
Preparing data for QSAR Modeling • Release NCCT models as interactive online prediction tools in the near future via the ICD.
• Integrate the suite of EPA T.E.S.T6 physchem and toxicity prediction models to expand the collection of
available models.
• Source additional data to expand the training sets underpinning the prediction algorithms.
4 STAR ENHANCED: Name/CASRN/Mol/SMILES added Stereo: 550
4 STAR: 3 of 4 Name/CASRN/Mol/SMILES: 5967
3 STAR ENHANCED: 3 of 4 Name/CASRN/Mol/SMILES added Stereo: 177
3 STAR: 3 of 4 Name/CASRN/Mol/SMILES: 7910
2 STAR PLUS: 2 of 4 Name/CASRN/Mol/SMILES/Tautomer: 133
2 STAR: 2 of 4 Name/CASRN/Mol/SMILES: 1003
1 STAR: No two fields consistent 379
2627/P125
American Chemical Society Fall Meeting
Philadelphia, PA.
August 21-25, 2016
• Check names against dictionary (555 invalid)
• Assign Quality flags based on consistency
among data fields
Examples of hypervalency in LogKow dataset
Equivalent structures but with different CASRN, names and
values in MP dataset: -4 and + 70oC
Equivalent structures, different CASRN, names, and
values in MP dataset
This poster does not necessarily reflect EPA policy. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.
For the purposes of QSAR modeling, the 3 and 4 STAR datasets were
processed through a KNIME workflow. This processing removed
inorganics and mixtures, processed salts into neutral forms (except for
melting point data), normalized tautomers, and removed duplicates.
The resulting “QSAR-ready” file(s) were modeled using Genetic
Algorithm-Partial Least Squares with 5-fold cross validation and
utilizing 2D PaDEL4 molecular descriptors. Multiple modeling runs
(100) produced the best models using a minimum number of
descriptors. The models for all 13 endpoints are available as both
Windows and Linux executable binaries and as a C++ library that can
be called by a separate application.
Model Performance
The curation of the available data, utilization of large datasets for training and validation, and application of
innovative machine-learning approaches produced high performance, yet simple models with a minimum number
of descriptors. The logP model was built on a dataset of more than 14k chemicals using only 10 descriptors. With
a wider applicability domain this model outperformed EPI Suite’s predictions (left side plot) especially for the
over-estimated cluster extracted and plotted together with the new model’s predictions (right side plot).
Different structures in Mol-Block and SMILES
Statistics of the new model
5-fold cross-validation:
Q2: 0.87 RMSE: 0.67
Fitting 11247:
R2: 0.87 RMSE: 0.66
Test set prediction :
R2: 0.84 RMSE: 0.65
• Compare Mol-Block and SMILES (2268 different)
• Check for duplicates (657 structures, 531 names)
• Check CASRN Numbers (3646 invalid CASRN)
Abstract
Problem: There is limited access to freely available high quality experimental and predicted data associated with
chemical structures, and their associated substances, available online.
Goals: To deliver access to curated datasets of experimental and physicochemical data associated with
chemicals of interest to environmental scientists. Make the data available as downloadable Open data. To
develop predictive models from the curated data and use to predict properties for >700,000 chemicals and make
available online. To provide details regarding the performance of the models.
A summary of available chemical properties
for Bisphenol A
The model report for the melting point prediction
for Bisphenol A – including nearest neighbors.
EPI Suite’s predictions. EPI Suite’s (red) Vs NCCT model’s predictions (black)

Contenu connexe

Tendances

OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSKamel Mansouri
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
 

Tendances (20)

The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 
Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings
 
Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...
 
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
 
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
 
Accessing information for chemicals in hydraulic fracturing fluids using the ...
Accessing information for chemicals in hydraulic fracturing fluids using the ...Accessing information for chemicals in hydraulic fracturing fluids using the ...
Accessing information for chemicals in hydraulic fracturing fluids using the ...
 
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
 
Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...
 
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
 
Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...
 
Development of a Tool for Systematic Integration of Traditional and New Appro...
Development of a Tool for Systematic Integration of Traditional and New Appro...Development of a Tool for Systematic Integration of Traditional and New Appro...
Development of a Tool for Systematic Integration of Traditional and New Appro...
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
 
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
 
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
 

En vedette

Conférence débat sur les réseaux sociaux
Conférence débat sur les réseaux sociauxConférence débat sur les réseaux sociaux
Conférence débat sur les réseaux sociauxRégis Gautheron
 
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Yue Liao
 
From Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectFrom Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectJoel Gehman
 
SMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementSMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementJoel Gehman
 
Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...NASIG
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...Andrew McEachran
 
Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...lisbk
 
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...Joel Gehman
 

En vedette (19)

Conférence débat sur les réseaux sociaux
Conférence débat sur les réseaux sociauxConférence débat sur les réseaux sociaux
Conférence débat sur les réseaux sociaux
 
Social Networking Tools for Scientists and Building an Online Profile
Social Networking Tools for Scientists and Building an Online ProfileSocial Networking Tools for Scientists and Building an Online Profile
Social Networking Tools for Scientists and Building an Online Profile
 
Building an Online Profile: Social Networking and Amplification Tools for Sc...
Building an Online Profile:  Social Networking and Amplification Tools for Sc...Building an Online Profile:  Social Networking and Amplification Tools for Sc...
Building an Online Profile: Social Networking and Amplification Tools for Sc...
 
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
 
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
 
NSF Data Management Requirements 101
NSF Data Management Requirements 101NSF Data Management Requirements 101
NSF Data Management Requirements 101
 
How One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online ChemistryHow One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online Chemistry
 
From Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectFrom Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki Project
 
SMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementSMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic Management
 
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...
 
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
 
Social Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online ProfileSocial Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online Profile
 
Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...
 
Building an Online Profile Using Social Networking and Amplification Tools fo...
Building an Online Profile Using Social Networking and Amplification Tools fo...Building an Online Profile Using Social Networking and Amplification Tools fo...
Building an Online Profile Using Social Networking and Amplification Tools fo...
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
 
Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...
 
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 

Similaire à The EPA Online Prediction Physicochemical Prediction Platform to Support Environmental Scientists

The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...Kamel Mansouri
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...Kamel Mansouri
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...Kamel Mansouri
 
Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Andrew McEachran
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...Kamel Mansouri
 

Similaire à The EPA Online Prediction Physicochemical Prediction Platform to Support Environmental Scientists (20)

The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...
 
The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
 
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
 
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
 
Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
 
Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
 
CompTox Chemicals Dashboard: Data and tools to support chemical and environme...
CompTox Chemicals Dashboard: Data and tools to support chemical and environme...CompTox Chemicals Dashboard: Data and tools to support chemical and environme...
CompTox Chemicals Dashboard: Data and tools to support chemical and environme...
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Medical science
Medical scienceMedical science
Medical science
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
 

Dernier

Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 

Dernier (20)

Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 

The EPA Online Prediction Physicochemical Prediction Platform to Support Environmental Scientists

  • 1. Innovative Research for a Sustainable Futurewww.epa.gov/research ORCID: 0000-0002-2668-4821 Antony Williams l williams.antony@epa.gov l 919-541-1033 The EPA Online Database of Experimental and Predicted Data to Support Environmental Scientists Antony Williams1,*, Christopher Grulke1, Kamel Mansouri2, Jennifer Smith1, Jordan Foster1, Michelle Krzyzanowski and Jeff Edwards1 1. U.S. Environmental Protection Agency, Office of Research and Development, National Center for Computational Toxicology (NCCT), Research Triangle Park, NC 2. Oak Ridge Institute for Science and Education (ORISE) Participant, Research Triangle Park, NC Problem Definition and Goals Source Data and Example Errors Automated Analysis Using KNIME Accessing ~1 Million Predicted Properties Online Future Work As part of our efforts to develop a public platform to provide access to experimental data and predictive models for physicochemical data we have delivered a web-based database, the EPA Chemistry Dashboard (https://comptox.epa.gov). In order to develop the prediction models we used publicly available data and a combination of manual and automated review processes to produce highly curated datasets. The automated curation processes for the validation of the data used a KNIME workflow and included approaches to validate different chemical structure representations (e.g. molfile and SMILES), identifiers (chemical names and registry numbers), and methods to standardize the data into QSAR-consumable formats for modeling. Machine-learning approaches were applied to create a series of models that have been used to generate predicted physicochemical and environmental parameters for over 700,000 chemicals. The data have been made available online via the EPA’s iCSS Chemistry Dashboard and includes detailed model reports regarding whether or not a chemical is contained within a particular applicability domain, provides access to nearest neighbors within the training set as well as detailed QMRF (QSAR Modeling Report Format) reports for each of the models. The iCSS Chemistry Dashboard (ICD) is a public EPA-hosted web application providing access to over 700,000 chemicals from EPA’s DSSTox database5. It integrates experimental and predicted data and is a hub to other NCCT apps and web-based resources. The 3 and 4 STAR curated experimental data are accessible via the ICD application. All chemicals were also passed through the prediction models and detailed model reports and results are freely available at http://comptox.epa.gov/dashboard. The manual investigation of the data allowed us to develop a KNIME2 workflow for automated processing. This workflow was derived from earlier work by Mansouri et. al.3 and is represented in the figure below as a series of blocks representing, for example: The KNIME workflow for automated processing of PHYSPROP data The KNIME workflow was used to insert various levels of Quality Flags indicating consistency between chemical structure formats and identifiers. The consistency flag definitions and distribution are summarized below for the >15k chemicals. Predictive models were developed using only the 3 and 4 star curated data. • The PHYSPROP data were sourced online1 as SDF files. A total of 13 endpoints were represented, including LogKow, water solubility, melting point, boiling point, and others. The largest dataset, LogKow, contained over 15,800 individual chemicals. Each data point included the Mol-Block, SMILES, CASRN, Name, LogKow value and, where available, a reference. This dataset was chosen as representative for examining the quality of data. • A manual examination of the data revealed a number of issues: e.g., SMILES and Mol-Block did not agree; CASRN did not match the correct structure; SMILES could not be converted; a single chemical structure would be listed multiple times with different property values. Example errors across various PHYSPROP datasets are shown below. References 1. PHYSPROP Data: http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm 2. KNIME: https://www.knime.org/ 3. Mansouri et al. CERAPP: Collaborative Estrogen Receptor Activity Prediction Project, Environ Health Perspect; DOI:10.1289/ehp.1510267 4. PaDEL descriptors, http://padel.nus.edu.sg/software/padeldescriptor/ 5. EPA Distributed Structure-Searchable Toxicity (DSSTox) Database, http://www.epa.gov/chemical- research/distributed-structure-searchable-toxicity-dsstox-database 6. EPA Toxicity Estimation Software Tool (T.E.S.T.) software http://www.epa.gov/chemical-research/toxicity- estimation-software-tool-test Preparing data for QSAR Modeling • Release NCCT models as interactive online prediction tools in the near future via the ICD. • Integrate the suite of EPA T.E.S.T6 physchem and toxicity prediction models to expand the collection of available models. • Source additional data to expand the training sets underpinning the prediction algorithms. 4 STAR ENHANCED: Name/CASRN/Mol/SMILES added Stereo: 550 4 STAR: 3 of 4 Name/CASRN/Mol/SMILES: 5967 3 STAR ENHANCED: 3 of 4 Name/CASRN/Mol/SMILES added Stereo: 177 3 STAR: 3 of 4 Name/CASRN/Mol/SMILES: 7910 2 STAR PLUS: 2 of 4 Name/CASRN/Mol/SMILES/Tautomer: 133 2 STAR: 2 of 4 Name/CASRN/Mol/SMILES: 1003 1 STAR: No two fields consistent 379 2627/P125 American Chemical Society Fall Meeting Philadelphia, PA. August 21-25, 2016 • Check names against dictionary (555 invalid) • Assign Quality flags based on consistency among data fields Examples of hypervalency in LogKow dataset Equivalent structures but with different CASRN, names and values in MP dataset: -4 and + 70oC Equivalent structures, different CASRN, names, and values in MP dataset This poster does not necessarily reflect EPA policy. Mention of trade names or commercial products does not constitute endorsement or recommendation for use. For the purposes of QSAR modeling, the 3 and 4 STAR datasets were processed through a KNIME workflow. This processing removed inorganics and mixtures, processed salts into neutral forms (except for melting point data), normalized tautomers, and removed duplicates. The resulting “QSAR-ready” file(s) were modeled using Genetic Algorithm-Partial Least Squares with 5-fold cross validation and utilizing 2D PaDEL4 molecular descriptors. Multiple modeling runs (100) produced the best models using a minimum number of descriptors. The models for all 13 endpoints are available as both Windows and Linux executable binaries and as a C++ library that can be called by a separate application. Model Performance The curation of the available data, utilization of large datasets for training and validation, and application of innovative machine-learning approaches produced high performance, yet simple models with a minimum number of descriptors. The logP model was built on a dataset of more than 14k chemicals using only 10 descriptors. With a wider applicability domain this model outperformed EPI Suite’s predictions (left side plot) especially for the over-estimated cluster extracted and plotted together with the new model’s predictions (right side plot). Different structures in Mol-Block and SMILES Statistics of the new model 5-fold cross-validation: Q2: 0.87 RMSE: 0.67 Fitting 11247: R2: 0.87 RMSE: 0.66 Test set prediction : R2: 0.84 RMSE: 0.65 • Compare Mol-Block and SMILES (2268 different) • Check for duplicates (657 structures, 531 names) • Check CASRN Numbers (3646 invalid CASRN) Abstract Problem: There is limited access to freely available high quality experimental and predicted data associated with chemical structures, and their associated substances, available online. Goals: To deliver access to curated datasets of experimental and physicochemical data associated with chemicals of interest to environmental scientists. Make the data available as downloadable Open data. To develop predictive models from the curated data and use to predict properties for >700,000 chemicals and make available online. To provide details regarding the performance of the models. A summary of available chemical properties for Bisphenol A The model report for the melting point prediction for Bisphenol A – including nearest neighbors. EPI Suite’s predictions. EPI Suite’s (red) Vs NCCT model’s predictions (black)