Environmental Cheminformatics to Identify Unknown Chemicals and their Effects
Assoc. Prof. Dr. Emma L. Schymanski
FNR ATTRACT Fellow and PI: Environmental Cheminformatics, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6 avenue du Swing, L-4367 Belvaux, Luxembourg.
The Environmental Cheminformatics group at the Luxembourg Centre for Systems Biomedicine focuses on the comprehensive identification of known and unknown chemicals in our environment to investigate their effects on health and disease. The environment and the chemicals to which we are exposed is incredibly complex, with over 125 million chemicals registered in the largest chemical registry and over 70,000 in household use alone. Detectable molecules in complex samples can now be captured using high resolution mass spectrometry (HRMS), which provides a “snapshot” of all chemicals present in a sample and allows for retrospective data analysis through digital archiving. However, scientists cannot yet identify the vast majority of the tens of thousands of features in each sample, leading to critical bottlenecks in identification and data interpretation. For instance, recent studies indicate a strong connection between the gut microbiome and Parkinson’s disease, yet over 60 % of significant metabolites in microbiome experiments are unknown. Unknown identification remains extremely time consuming and, in many cases, a matter of luck. Prioritizing efforts to find significant metabolites or potentially toxic substances responsible for observed effects is the key, which involves reconciling highly complex samples with expert knowledge and careful validation. This talk will cover European, US and worldwide community initiatives to help connect knowledge on chemistry and toxicity with environmental observations - from compound databases to spectral libraries and retrospective screening. It will touch on the challenges of standardized structure representations, data curation, deposition and communication between resources. Finally, it will show how interdisciplinary efforts and data sharing can facilitate research in metabolomics, exposomics and beyond.
NOTE: some slides causing errors have been removed but can be accessed through the tinyurl on the front page.
Active hyperlinks can be retrieved using the tinyurl on the front page. Please cite this work if you use any of the contents.
2. 2
Outline for Today
o Background about LCSB
• LCSB & University of Luxembourg
• Biomedicine and Parkinson’s Disease
• Environmental Cheminformatics @LCSB
o European(+) Community Efforts for Unknown ID
• Mass Spectral Libraries (www.massbank.eu)
• NORMAN Suspect Exchange and CompTox Chemicals Dashboard
• Metadata, MS-ready and MetFrag
• Bigger Picture Examples (Rhine, NormanNEWS, DSFP)
o Work in Progress and Future Challenges
• Complex Mixtures – Cheminformatics to Screen Undefined Structures
• Preview: Disease-specific & MetFrag-compatible Metadata
• Bonus slides on HDX (an entire presentation) if anyone wants
4. 4
University of Luxembourg & LCSB
o Uni Lu was founded in 2003
• We just turned 15 (teenage years!)
o LSCB was founded in 2009
• …and is still pre-teenager
• Young and very dynamic working environment!
5. 5
Environmental Cheminformatics … the Group
S. Gene; https://en.wikipedia.org/wiki/File:Zwei_zigaretten.jpg; R. Singh; DOI:10.1186/s13321-017-0223-1; DOI: 10.1016/j.aca.2017.12.034
Sources:
6. 6
Our challenge? We still have many unknowns …
o …in both environmental and metabolomics analysis
(l) Data from Schymanski et al 2014, ES&T DOI: 10.1021/es4044374. (r) E. coli data provided by N. Zamboni, IMSB, ETH Zürich.
Wastewater
Cells
7. 7
(European) Environmental Community (subset!)
Schymanski et al. 2015, ABC, DOI: 10.1007/s00216-015-8681-7
Croatian
Water
RWS
Specialist Knowledge
Highly Disjointed
8. 8
1 10 100 1000 10000 100000 1 million 1 billion chemicals …. …. ….
Our (Community) Challenge: Identifying Chemicals
Data: Schymanski et al 2014, Environ. Sci. Technol. DOI: 10.1021/es4044374; Hollender et al 2017 DOI: 10.1021/acs.est.7b02184
Sample
High resolution
mass spectrometry
9. 9
1 10 100 1000 10000 100000 1 million 1 billion chemicals …. …. ….
Our (Community) Challenge: Identifying Chemicals
Data: Schymanski et al 2014, DOI: 10.1021/es4044374; https://www.slideshare.net/EmmaSchymanski/small-molecules-in-big-data-analytica-munich
Sample
High resolution
mass spectrometry
Chemicals
AND connecting
chemical knowledge
11. 11
MassBank EU
https://github.com/MassBank/MassBank-data; https://github.com/MassBank/MassBank-web/; Rösch et al DOI 10.1021/acs.est.5b05186
http://massbank.eu/MassBank
o MassBank.EU was founded late 2012, hosted at UFZ, Leipzig, Germany
o >16,000 MS/MS spectra; 1,200 substances from NORMAN members
o MassBank now has >46,000 spectra from 32 contributing institutes!
o Thorough Github-based modernization in progress for traceability:
o Tentative/unknown/literature spectra (Level Scheme) as SI for publications
Schymanski et al DOI: 10.1021/es5002105
13. 13
Confidence Levels for Tentative Structures
Schymanski, Jeon, Gulde, Fenner, Ruff, Singer & Hollender (2014) ES&T, 48 (4), 2097-2098. DOI: 10.1021/es5002105
o Annotation is the key to communicating information
MS, MS2, RT, Reference Std.
Level 1: Confirmed structure
by reference standard
Level 2: Probable structure
a) by library spectrum match
b) by diagnostic evidence
Identification confidence
N
N
N
NHNH
CH3
CH3
S
CH3
OH
MS, MS2, Library MS2
MS, MS2, Exp. data
Example Minimum data requirements
Level 4: Unequivocal molecular formula
Level 5: Exact mass of interest
C6H5N3O4
192.0757
MS isotope/adduct
MS
Level 3: Tentative candidate(s)
structure, substituent, class MS, MS2, Exp. data
14. 14
Creating High-Quality Mass Spectra
Stravs, Schymanski, Singer and Hollender, 2013, Journal of Mass Spectrometry, 48, 89–99. DOI: 10.1002/jms.3131
Automatic MS and MS/MS
Recalibration and Clean-up
Remove interfering peaks
Spectral Annotation with
- Experimental Details
- Compound Information
https://github.com/MassBank/RMassBank/
http://bioconductor.org/packages/RMassBank/
15. 15
Communicating Mass Spectra for Mixtures
Stravs et al. (2013), J. Mass Spectrom, 48(1):89-99. DOI: 10.1002/jms.3131
OHSO
O
CH3
O
OH
m n
SPA-9C
m+n=6
Formulas: http://sourceforge.net/projects/genform/
Meringer et al, 2011, MATCH 65, 259-290
Data: Schymanski et al. 2014, ES&T, 48:
1811-1818. DOI: 10.1021/es4044374
Chromatography and MS/MS Annotation
Literature: LIT00034,35
Sample: ETS00002
Standard: ETS00016,17,19,20
https://github.com/MassBank/RMassBank/
16. 16
1 10 100 1000 10000 100000 1 million 1 billion chemicals …. …. ….
Our (Community) Challenge: Identifying Chemicals
Data: Schymanski et al 2014, DOI: 10.1021/es4044374; https://www.slideshare.net/EmmaSchymanski/small-molecules-in-big-data-analytica-munich
Sample
High resolution
mass spectrometry
Chemicals
AND connecting
chemical knowledge
17. 17
European (World-)Wide Exchange of Suspects
Schymanski et al. 2015, ABC, DOI: 10.1007/s00216-015-8681-7
NORMAN Suspect List Exchange:
http://www.norman-network.com/?q=node/236
18. 18
NORMAN Suspect List Exchange
o http://www.norman-network.com/?q=node/236
Schymanski, Aalizadeh et al. in prep; https://www.researchgate.net/project/Supporting-Mass-Spectrometry-Through-Cheminformatics
ReferencesFull Lists
19. 19
o Now 21 lists available online … from small to large!
• Specialist collections (e.g. NormaNEWS) to large market lists
• Integrated into the CompTox Chemistry Dashboard
NORMAN Suspect Exchange Lists
20. 20
NORMAN Lists => CompTox Dashboard
https://comptox.epa.gov/dashboard/chemical_lists/normanews
http://www.norman-network.com/?q=node/236
https://comptox.epa.gov/dashboard/chemical_lists/normanews
21. 21
Lists on CompTox Chemicals Dashboard
https://comptox.epa.gov/dashboard/chemical_lists/
More lists become available with every release
33. 33
MS-ready: McEachran et al. 2018, J Cheminform. DOI: 10.1186/s13321-018-0299-2
Connecting Resources in MetFrag
34. 34
Connecting and Enhancing Open Resources
https://www.slideshare.net/EmmaSchymanski/small-molecules-in-big-data-analytica-munich
o Sharing knowledge is a win-win situation
2014 2015: found in waters across Europe
2016: 1 datapoint cross-annotates 3072 in GNPS
Hits in GNPS MassIVE datasets:
Surfactants: http://goo.gl/7sY9Pf
2017: Early-Warning
System is born
2018: Highlighted in
Science
35. 35
NORMAN Digital Sample Freezing Platform
“Live” retrospective screening of known and unknown
chemicals in European samples (various matrices)
http://norman-data.eu/ AND Alygizakis et al, in prep.
36. 36
Interactive heatmap available at http://norman-data.eu/NORMAN-REACH
NORMAN Digital Sample Freezing Platform
Retrospective screening of REACH chemicals in
Black Sea samples (various matrices)
37. 37
NORMAN Digital Sample Freezing Platform
“Live” retrospective screening of known and unknown
chemicals in European samples (various matrices)
Future work: use results of unknowns to drive prioritization efforts
http://norman-data.eu/ AND Alygizakis et al, in prep.
38. 38
Real-time Monitoring of the Rhine River
Hollender, Schymanski, Singer & Ferguson, 2018, ES&T Feature, 51:20, 11505-11512. DOI: 10.1021/acs.est.7b02184
Previously unknown chemicals detected due to “stand-out” patterns
39. 39
Real-time Monitoring of the Rhine River
Hollender, Schymanski, Singer & Ferguson, 2018, ES&T Feature, 51:20, 11505-11512. DOI: 10.1021/acs.est.7b02184
Previously unknown chemicals detected due to “stand-out” patterns
40. 40
We still have many unknowns …
(l) Data from Schymanski et al 2014, ES&T DOI: 10.1021/es4044374. (r) E. coli data provided by N. Zamboni, IMSB, ETH Zürich.
Environment
Cells
43. 43
Homologous Series Detection
M. Loos & H Singer, 2017. J. Cheminf. DOI: 10.1186/s13321-017-0197-z & Schymanski et al. 2014, ES&T DOI: 10.1021/es4044374
http://www.envihomolog.eawag.ch/
Search for
discrete
mass
differences S OO
OH
CH3
CH3
m
n
C9H19
O
O
S
O
O
OHm
45. 45
Cross-Linking Homologues in the Dashboard
Schymanski, Grulke, Williams et al, in prep. & Williams et al. 2017 J. Cheminformatics 9:61 DOI: 10.1186/s13321-017-0247-6
https://comptox.epa.gov/dashboard/chemical_lists/eawagsurf
46. 46
Homologous Series in Biological Matrices
Lipid extract of Mycobacterium smegmatis
C23F48O7
+CF2
Schymanski & Zamboni … random data exploration …
47. 47
Exchanging Knowledge … Open Science Helps!
We need to be able to find and annotate the unexpected!
C23F48O7
+CF2
Schymanski & Zamboni … random data exploration …
48. 48
Exchanging Knowledge … Open Science Helps!
We need to be able to find and annotate the unexpected!
49. 49
Exchanging data reveals things we never expected!
Schymanski & Zamboni … random data exploration …
o Lipid extract of Mycobacterium smegmatis
C23F48O7
+CF2
DTXSID70880513DTXSID70880513
50. 50
Community Challenges … and Solutions
Data: Schymanski et al 2014, DOI: 10.1021/es4044374; https://www.slideshare.net/EmmaSchymanski/small-molecules-in-big-data-analytica-munich
High resolution
mass spectrometry
AND connecting
chemical knowledge
51. 51
Target List Suspect List
(e.g. NORMAN,
LMC, Eawag-PPS,
ReSOLUTION)
Componentization
(nontarget)
TARGET
ANALYSIS
SUSPECT
SCREENING
NON-TARGET
SCREENING
(enviMass,
vendor software)
Gather evidence
(nontarget,
ReSOLUTION,
RMassBank)
Masses of interest
Molecular formula
determination
(enviPat, GenForm)
Non-target identification
(MetFrag2.3, ReSOLUTION)
Sampling extraction (SPE) HPLC separation HR-MS/MS
Detection of blank/blind/noise/internal standards; time trend analysis (enviMass)
Conversion (Proteowizard) and Peak Picking (enviPick, xcms, MZmine, …)
Prioritization
(enviMass)
MS/MS Extraction
(RMassBank)
Interpretation, confirmation, peak inventory, confidence and reporting
52. 52
Coming Soon … (WiP and Already Online!)
Schymanski, Baker, Williams, Singh et al. in preparation. Excel macro: https://figshare.com/s/824f6606644f474c7288
https://comptox.epa.gov/dashboard/chemical_lists/litminedneuro
53. 53
Conclusions / Outlook / Perspectives
Monzel et al 2017 Stem Cell Reports, DOI: 10.1016/j.stemcr.2017.03.010 (Organoids)
o Over 60 % of HR-MS peaks are potentially relevant but unknown
o Non-target screening requires data and evidence from many different sources
o Many excellent workflows now available to collate this information
o Incorporation of all available metadata (expert knowledge) is critical to
success!
o Complex mixtures (UVCBs) are a huge and very challenging part of the puzzle
o New cheminformatics approaches needed - great progress so far
o Information in the public domain helps everyone!
o Additional experimental methods can provide more information
o H-D exchange-based labelling [EXTRA SLIDES]
o Integration of computational toxicity knowledge essential
o LCSB has some amazing facilities and expertise
(I am just beginning to appreciate how much …)