The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program utilizes computational and data-driven approaches that integrate chemistry, exposure and biological data to help characterize potential risks from chemical exposure. The National Center for Computational Toxicology (NCCT) has measured, assembled and delivered an enormous quantity and diversity of data for the environmental sciences, including high-throughput in vitro screening data, in vivo and functional use data, exposure models and chemical databases with associated properties. The CompTox Chemicals Dashboard website provides access to data associated with ~900,000 chemical substances. New data are added on an ongoing basis, including the registration of new and emerging chemicals, data extracted from the literature, chemicals studied in our labs, and data of interest to specific research projects at the EPA. Hazard and exposure data have been assembled from a large number of public databases and as a result the dashboard surfaces hundreds of thousands of data points. Other data includes experimental and predicted physicochemical property data, in vitro bioassay data for over 4000 chemicals and ~1500 assays, and millions of chemical identifiers (names and CAS Registry Numbers) to facilitate searching. Other integrated modules include an interactive read-across module, real-time physicochemical and toxicity endpoint prediction and an integrated search to PubMed. This presentation will provide an overview of the CompTox Chemicals Dashboard and how it has developed into an integrated data hub for environmental data. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
1. US-EPA Chemicals Dashboard – an integrated
data hub for environmental science
West Coast Metabolomics Weminar,
September 30th 2020
http://www.orcid.org/0000-0002-2668-4821
The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
Antony Williams
Center for Computational Toxicology and Exposure, US-EPA, RTP, NC
…and an enormous cast of characters
2. CompTox Chemicals Dashboard
• A publicly accessible website delivering access:
– ~882,000 chemicals with related property data
– Experimental and predicted physicochemical property data
– Experimental Human and Ecological hazard data
– Integration to “biological assay data” for 1000s of chemicals
– Information regarding consumer products containing chemicals
– Links to other agency websites and public data resources
– “Literature” searches for chemicals using public resources
– “Batch searching” for thousands of chemicals
1
24. Building a “reference” PFAS list
• PFAS structure list (PFASSTRUCT)
is expanded from public databases, EPA
agency lists and literature
• Approaching ~7000 structures – 98.8% have
associated CAS Numbers
• Compare with PubChem 220,720 structures
23
30. How many chemicals to consider?
• Dashboard content is small but focused
curated content
• Dashboard ~900,000
• ChemSpider ~90,000,000
• PubChem ~111,000,000
• CAS ~ 165,000,000
29
31. BIG databases are GREAT!
P
u
b
C
h
e
m
C
A
S
R
e
g
is
try
C
h
e
m
S
p
id
e
r
E
P
A
D
S
S
T
o
x
B
lo
o
d
E
x
p
o
s
o
m
e
1 0 4
1 0 5
1 0 6
1 0 7
1 0 8
1 0 9
ChemicalSubstances
• Thanks to all of the public database efforts
• So much benefit from what’s been done
• There are hundreds of them at this point…
32. What chemicals constitute the Exposome?
• What constitutes the exposome is, of
course, difficult to answer.
• Focused chemical list expansion highlights
chemical classes of interest – chemicals in
commerce, pesticides, cosmetics, PFAS
31
35. PubChem – “virtual chemistry”
• Other databases grow quickly…a lot of “virtual
chemistry” and “make on demand” compounds.
Vomitoxin has 7 ZINC stereoforms.
• The Dashboard database grows slowly (next
release is +20k chemicals in 6 months)
34
36. ChemSpider – lots of virtuals???
35
• 52 million chemicals
from one vendor
46. Overview of MS-Ready Structures
• All structure-based chemical substances are
algorithmically processed to
– Split multicomponent chemicals into individual structures
– Desalt and neutralize individual structures
– Remove stereochemical bonds from all chemicals
• MS-Ready structures are then mapped to
original substances to provide a path between
chemicals detected by mass spectrometry to
original substances
45
56. MS-Ready Mappings
• 125 chemicals returned in total
– 8 of the 125 are single component chemicals
– 3 of the 8 are isotope-labeled
– 3 are neutral compounds and 2 are charged
• Multiple components, stereo, isotopes and
charge all collapsed and mapped through
MS-Ready
55
58. Batch Searching
• Singleton searches are useful but we work
with thousands of masses and formulae!
• Typical questions
– What is the list of chemicals for the formula CxHyOz
– What is the list of chemicals for a mass +/- error
– Can I get chemical lists in Excel files? In SDF files?
– Can I include properties in the download file?
57
62. Benefits of bringing it all together
• The true dashboard benefit is integration
• Rank potential candidates for toxicity using
available data – hazard, exposure, in vitro
61
68. Data Source Ranking of
“known unknowns”
67
• A mass and/or formula search is
for an unknown chemical but it
is a known chemical contained
within a reference database
• Most likely candidate chemicals
have the most associated data
sources, most associated
literature articles or both
C14H22N2O3
266.16304
Chemical
Reference
Database
Sorted candidate
structures
69. Data Streams for Ranking
• CompTox Dashboard Data Sources
• PubChem Data Source Count
• PubMed Reference Count
• Toxcast in vitro bioactivity
• Presence in CPDat database
• OPERA PhysChem Properties
• Other possibilities – predicted media
occurrence, frequency of InChIs online
73. Is a bigger database better?
72
• ChemSpider was 26 million chemicals for
the original work
• Much BIGGER today
• Is bigger better??
• Are there other metadata to use for ranking?
74. Comparing Search Performance
73
• When dashboard contained 720k chemicals
• Only 3% of ChemSpider size
• What was the comparison in performance?
76. How did performance compare?
75
For the same 162 chemicals,
Dashboard outperforms
ChemSpider for both Mass and
Formula Ranking
77. Identification ranks for 1783 chemicals
using multiple data streams
76
DS: Data Sources
PC: PubChem
PM: PubMed
STOFF: DB
KEMI: DB
Data Sources alone
rank ~75% of the
chemicals as Top Hit
80. UVCBs challenge in non-target analysis
79
Homologue screening plots from
Swiss Wastewater (Schymanski et al
2014, left) and Novi Sad (right)
o Complex mixtures (UVCBs) are a huge
and very challenging part of the
unknowns in many environmental
samples
90. List Registration Activities
• Registering and curating numerous lists
– NIST library of chemicals –clean up especially around
stereochemical representation
– United States Geological Survey chemicals in water
– Scientific Working Group for the Analysis of Seized Drugs
– Synthetic Cannabinoids
– Blood Exposome Database
89
91. Blood Exposome Curation
90
• Blood exposome data collection from Barupal and
Fiehn. Great work and we reviewing.
• Aggregating large datasets is CHALLENGING
• Comparing with our “Abstract Sifter” approach
• There is a LIMITED curated form online..
https://comptox.epa.gov/dashboard/chemical_lists/
BLOODEXPOSOME (19867 chemicals)
92. Prototype Work in Progress
• CFM-ID
– Viewing and Downloading pre-predicted spectra
– Search spectra against the database
• Hazard Comparison Dashboard:
https://hazard.sciencedataexperts.com/
• Structure/substructure/similarity search
• Access to API and web services
• Integration to EPA “Chemical Transformation
Simulator”
91
100. CASMI 2012-2017 revisited
• Application of metadata candidate ranking
and CFM-ID to all five years of CASMI data
99
101. Method Amenability Prediction
Charlie Lowe
Why?
• Chromatography-mass
spectrometry can be LC or GC
• Which phase is more appropriate
for which chemicals?
102. Ongoing Work
• Data sources to date
• Massbank of North America
• 9,275 chemicals for non-derivatized GC
• 846 chemicals for derivatized GC
• 816 chemicals for APCI+
• 454 chemicals for APCI-
• 4,907 chemicals for ESI+
• 3,430 chemicals for ESI-
• EPA Non-targeted Analysis Collaborative Trial (ENTACT)
• 886 chemicals for non-derivatized GC
• 44 chemicals for derivatized GC
• 774 chemicals for APCI+
• 431 chemicals for APCI-
• 1,113 chemicals for ESI+
• 648 chemicals for ESI-
112. You want to know more…
• Lots of resources available
– Presentations: https://tinyurl.com/w5hqs55
– Communities of Practice Videos: https://rb.gy/qsbno1
– Manual: https://rb.gy/4fgydc
– Latest News: https://comptox.epa.gov/dashboard/news_info
111
117. Conclusion
• Dashboard access to data for ~882,000 chemicals
• MS-Ready data facilitates structure identification
• Related metadata facilitates candidate ranking
116
• Relationship mappings and
chemical lists of great utility
• Curation and mutual
sharing of chemical lists is
important (e.g. NORMAN)
119. A request for the audience
• Please submit comments for curation
118
120. Please share data and lists
• Help expand existing lists with new data
• Consider using DTXSIDs instead of just
Names/CASRNs in your published tables
119
121. ILS
Kamel Mansouri
EPA ORD
Ann Richard
Chris Grulke
John Wambaugh
Jeremy Dunne
Jeff Edwards
Grace Patlewicz
Alex Chao
Kristin Isaacs
Charles Lowe
James McCord
Seth Newton
Katherine Phillips
Tom Purucker
Jon Sobus
Mark Strynar
Elin Ulrich
Joach Pleil
GDIT
Ilya Balabin
Tom Transue
Tommy Cathey
Acknowledgements
TEAMS
IT Development Team
Curation Team
Collaborators
Emma Schymanski
NORMAN Network
Andrew McEachran
Jerry Zweigenbaum
122. Contact
Antony Williams
CCTE, US EPA Office of Research and Development,
Williams.Antony@epa.gov
ORCID: https://orcid.org/0000-0002-2668-4821
121