SlideShare une entreprise Scribd logo
1  sur  73
Télécharger pour lire hors ligne
Importance of data standards for
large scale data integration in
chemistry
Antony Williams, Valery Tkachenko, Alexey
Pshenichnov, Ken Karapetyan, Stuart Chalk,
Daniel Lowe and Carlos Coba
ACS Denver, March 2015
Free and Easy
• To make it easy to “take notes” these slides
will be available at:
www.slideshare.net/AntonyWilliams/
Charles Holland Duell
Charles Holland Duell
• 1898-1901: US Commissioner
of Patents
• "Everything that can be
invented has been invented."
Antony John Williams (et al)
Antony John Williams (et al)
• “We don’t need more
standards!”
• “Of COURSE we can build
a spectral database!”
• “The standards we have
are good enough”
A Pragmatic View to Progress
• Let’s consider progressing an NMR Spectral
database for the community!
• MUST HAVES– spectra (1D/2D), associated
structures, assignments
• WANTS – predict NMR spectra, spectral
searching, privacy/embargos
• What would we need in terms of standards?
• Molfiles and JCAMP
Standards without adoption..
Standards
2D NMR
Progress in standards
Progress in standards
Standards without adoption
are limited in value
• If the instrument vendors don’t support or
adopt the standards success is limited
• YESTERDAY discussion about publishing
NMR – JCAMP
• But what is already available will work – Jeol,
Bruker, Thermo, Anasazi, Agilent/Varian -
imperfect but useful
www.ChemSpider.com
9400 Spectra and growing
http://www.chemspider.com/spectra.aspx
JCAMP NMR Spectra
Data on ChemSpider
JCAMP file downloads
• When NMR spectra are stored as JCAMP
then downloads into offline packages are
feasible – MestreLabs, ACD/Labs etc
• Open Data – download versus view
• Store spectra locally and reuse
• Java is increasingly a pain!
• Need to move to HTML5 viewing on
ChemSpider, especially for Mobile Viewing
Challenges with Spectra
• JCAMP is good for a lot of spectral data – IR,
Raman, 1D NMR
• MS data is rarely made available in JCAMP
• We would love a ratified JCAMP 6.0 for 2D
data exchange – allows third parties to build
support for download
• ASSIGNED JCAMP spectra supported
Proper Verification
03/25/15
Advanced Chemistry Development, Inc.
(ACD/Labs)
20
Jmol - JSpecView
ChemDoodle Components
Spectral Display in the hand
New Repository Architecture
doi: 10.1007/s10822-014-9784-5
Compounds
Reactions
Analytical data
Deposition of Data
1,000,000 Spectra Online?
ESI – Text Spectra
Developing Proof-of-Concept
• Extract from 1976-2014 USPTO applications
*unknown – starts off with NMR: peak list (no nucleus)
H 975543
C 56536
unknown 44306
F 9429
P 3241
B 91
Si 62
Sn 22
Se 11
N 8
We want to find text spectra?
• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH,
benzylic methane), 30.77 (CH, benzylic
methane), 66.12 (CH2), 68.49 (CH2), 117.72,
118.19, 120.29, 122.67, 123.37, 125.69, 125.84,
129.03, 130.00, 130.53 (ArCH), 99.42, 123.60,
134.69, 139.23, 147.21, 147.61, 149.41,
152.62, 154.88 (ArC)
• What would be better are spectral figures – and
include assignments where possible!
MestreLabs Mnova NMR
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t,
1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz,
C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane),
30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19,
120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42,
123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
ESI Data also contains figures
Publications & “Real Spectra”
• We are turning text into spectra
• We are turning figures into spectra
Early Test Experiments

Input

74 supplementary data documents. 3444 pages

Output

Plot2Txt extracted content from 1069 pages

1151 spectra total - >80% of peaks extracted to
within 1-2 decimal places (ppm)
“Where is the real data please?”
FIGURE
DATA
Manual Curation Layer
• ALL SPECTRA WILL BE STORED AS JCAMP
• ChemSpider has had a manual curation layer
for >8 years
• Users can annotate data on ChemSpider
• We do receive useful feedback from the
community on the data and are optimistic!
Extraction is the WRONG WAY
• We should NOT mine data out – digital form!
• Structures should be submitted “correctly”
• Spectra should be digital spectral formats,
not images
• ESI should be RICH and interactive
• Data should be open, available, with meta
data and provenance
We can solve for Authors here
Will it be used though??? YES!
Supplementary Info Data now..
Data mining – it’s MINE!!!
What should we be doing?
• Settle on a short-term format – JCAMP-JMOL?
But there ARE solutions!
But there ARE solutions!
What should we be doing?
• Settle on a short-term format – JCAMP-JMOL?
• Convince the instrument vendors to export in
this format
• Push button depositions into “containers” –
ChemSpider, NMRShiftDB, Institutional
Repositories
• Encourage format support in software (read
and write) – Mestre, ACD/Labs, Bruker
TopSpin, etc.
NMRShiftDB anyone?
Standards in Large Scale
Data Integration
• ALL of these are imperfect standards
• Molfiles
• SDF
• InChI
• JCAMP
• But what can be done with them?
Compound Data
• The standards of chemical structure handling
are primarily molfile, SDfile, SMILES, InChI
• We primarily depend on molfiles and SDF
files for data deposition and interchange
• We use InChI a lot – especially for integrated
searching across the web
Searching the Entire Web?
Searching Internet by Structure
Compound Data
• The standards of chemical structure handling
are primarily molfile, SDfile, SMILES, InChI
• We primarily depend on molfiles and SDF
files for data deposition and interchange
• We use InChI a lot – especially for integrated
searching across the web
• There ARE data interchange problems
associated with structures….
USE and TEACH Standards
• Too few people are aware of the existing
standards and their capabilities
• Part of the CINF mission activities should be
to teach standards and this is being done
• Still too few people have heard of InChI and
JCAMP for example
• Still little known about the importance of
correct structure representations – kudos to
people like Leah et al who TEACH THIS!
USE and TEACH Standards!
USE and TEACH Standards!
CVSP: Validate and Standardize
CVSP Rules Sets
CVSP Filtering of DrugBank
Compounds
Reactions
Use Ontologies
Contribute to PUBLIC
Ontologies
• Yes there are “company” ontologies – but for
the good of the community contribute to
public ontologies and standards
• For data interchange and meshing this is
soooooo beneficial!
ChAMP – Stuart Chalk
Use standards in APIs,
endpoints and widgets
Semanticize content : RDF
Actions
• Support and encourage new standards
• In the meantime, reawaken and modernize the
JCAMP standard
• Show up and listen to Bob Hanson today
• Encourage scientists to provide data
Charles Holland Duell in 1902
“…all previous advances in the
various lines of invention will
appear totally insignificant when
compared with those which the
present century will witness.
I almost wish that I might live my
life over again to see the wonders
which are at the threshold”
“Git-r-Done”
Acknowledgments
• Daniel Lowe – NextMove, Reactions and Spectra
• Bill Brouwer – Plot2Txt Development
• Carlos Cobas and Stan Sykora– MestreLabs
• The ChemSpider team – led by Richard Kidd
• The RSC Data Repository team
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Contenu connexe

Similaire à Importance of data standards for large scale data integration in chemistry

Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
 
2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...
2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...
2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...Ardan Patwardhan
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsKen Karapetyan
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP
 
On chemical structures, substances, nanomaterials and measurements
On chemical structures, substances, nanomaterials and measurementsOn chemical structures, substances, nanomaterials and measurements
On chemical structures, substances, nanomaterials and measurementsNina Jeliazkova
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
BIOMAG2018 - Darren Price - CamCAN
BIOMAG2018 - Darren Price - CamCANBIOMAG2018 - Darren Price - CamCAN
BIOMAG2018 - Darren Price - CamCANRobert Oostenveld
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryDr. Haxel Consult
 
Publication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesPublication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesChristoph Steinbeck
 

Similaire à Importance of data standards for large scale data integration in chemistry (20)

Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...
2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...
2nd Microscopy Congress: Public archiving of bio-imaging data - perspectives,...
 
The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...
 
ChemValidator – an online service for validating and standardizing chemical s...
ChemValidator – an online service for validating and standardizing chemical s...ChemValidator – an online service for validating and standardizing chemical s...
ChemValidator – an online service for validating and standardizing chemical s...
 
AI for Science
AI for ScienceAI for Science
AI for Science
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...
 
On chemical structures, substances, nanomaterials and measurements
On chemical structures, substances, nanomaterials and measurementsOn chemical structures, substances, nanomaterials and measurements
On chemical structures, substances, nanomaterials and measurements
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
BIOMAG2018 - Darren Price - CamCAN
BIOMAG2018 - Darren Price - CamCANBIOMAG2018 - Darren Price - CamCAN
BIOMAG2018 - Darren Price - CamCAN
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 
Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
 
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
 
Publication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesPublication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic molecules
 

Dernier

Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...Chiheb Ben Hammouda
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
AICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awarenessAICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awareness1hk20is002
 
Interpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWSTInterpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWSTAlexander F. Mayer
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasChayanika Das
 
Food_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyFood_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyHemantThakare8
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsMarkus Roggen
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyChayanika Das
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsDanielBaumann11
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerLuis Miguel Chong Chong
 
Think Science: What Are Eclipses (101), by Craig Bobchin
Think Science: What Are Eclipses (101), by Craig BobchinThink Science: What Are Eclipses (101), by Craig Bobchin
Think Science: What Are Eclipses (101), by Craig BobchinNathan Cone
 
Introduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsIntroduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsCreative-Biolabs
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...Chayanika Das
 

Dernier (20)

Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
AICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awarenessAICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awareness
 
Interpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWSTInterpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWST
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
 
Food_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyFood_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiology
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of Cancer
 
Think Science: What Are Eclipses (101), by Craig Bobchin
Think Science: What Are Eclipses (101), by Craig BobchinThink Science: What Are Eclipses (101), by Craig Bobchin
Think Science: What Are Eclipses (101), by Craig Bobchin
 
Introduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsIntroduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative Biolabs
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
 

Importance of data standards for large scale data integration in chemistry

  • 1. Importance of data standards for large scale data integration in chemistry Antony Williams, Valery Tkachenko, Alexey Pshenichnov, Ken Karapetyan, Stuart Chalk, Daniel Lowe and Carlos Coba ACS Denver, March 2015
  • 2. Free and Easy • To make it easy to “take notes” these slides will be available at: www.slideshare.net/AntonyWilliams/
  • 4. Charles Holland Duell • 1898-1901: US Commissioner of Patents • "Everything that can be invented has been invented."
  • 6. Antony John Williams (et al) • “We don’t need more standards!” • “Of COURSE we can build a spectral database!” • “The standards we have are good enough”
  • 7. A Pragmatic View to Progress • Let’s consider progressing an NMR Spectral database for the community! • MUST HAVES– spectra (1D/2D), associated structures, assignments • WANTS – predict NMR spectra, spectral searching, privacy/embargos • What would we need in terms of standards? • Molfiles and JCAMP
  • 13. Standards without adoption are limited in value • If the instrument vendors don’t support or adopt the standards success is limited • YESTERDAY discussion about publishing NMR – JCAMP • But what is already available will work – Jeol, Bruker, Thermo, Anasazi, Agilent/Varian - imperfect but useful
  • 15. 9400 Spectra and growing http://www.chemspider.com/spectra.aspx
  • 18. JCAMP file downloads • When NMR spectra are stored as JCAMP then downloads into offline packages are feasible – MestreLabs, ACD/Labs etc • Open Data – download versus view • Store spectra locally and reuse • Java is increasingly a pain! • Need to move to HTML5 viewing on ChemSpider, especially for Mobile Viewing
  • 19. Challenges with Spectra • JCAMP is good for a lot of spectral data – IR, Raman, 1D NMR • MS data is rarely made available in JCAMP • We would love a ratified JCAMP 6.0 for 2D data exchange – allows third parties to build support for download • ASSIGNED JCAMP spectra supported
  • 20. Proper Verification 03/25/15 Advanced Chemistry Development, Inc. (ACD/Labs) 20
  • 24. New Repository Architecture doi: 10.1007/s10822-014-9784-5
  • 30. ESI – Text Spectra
  • 31. Developing Proof-of-Concept • Extract from 1976-2014 USPTO applications *unknown – starts off with NMR: peak list (no nucleus) H 975543 C 56536 unknown 44306 F 9429 P 3241 B 91 Si 62 Sn 22 Se 11 N 8
  • 32. We want to find text spectra? • We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC) • What would be better are spectral figures – and include assignments where possible!
  • 34. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  • 35. 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  • 36. ESI Data also contains figures
  • 37. Publications & “Real Spectra” • We are turning text into spectra • We are turning figures into spectra
  • 38. Early Test Experiments  Input  74 supplementary data documents. 3444 pages  Output  Plot2Txt extracted content from 1069 pages  1151 spectra total - >80% of peaks extracted to within 1-2 decimal places (ppm)
  • 39. “Where is the real data please?” FIGURE DATA
  • 40. Manual Curation Layer • ALL SPECTRA WILL BE STORED AS JCAMP • ChemSpider has had a manual curation layer for >8 years • Users can annotate data on ChemSpider • We do receive useful feedback from the community on the data and are optimistic!
  • 41. Extraction is the WRONG WAY • We should NOT mine data out – digital form! • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive • Data should be open, available, with meta data and provenance
  • 42. We can solve for Authors here Will it be used though??? YES!
  • 44. Data mining – it’s MINE!!!
  • 45. What should we be doing? • Settle on a short-term format – JCAMP-JMOL?
  • 46. But there ARE solutions!
  • 47. But there ARE solutions!
  • 48. What should we be doing? • Settle on a short-term format – JCAMP-JMOL? • Convince the instrument vendors to export in this format • Push button depositions into “containers” – ChemSpider, NMRShiftDB, Institutional Repositories • Encourage format support in software (read and write) – Mestre, ACD/Labs, Bruker TopSpin, etc.
  • 50. Standards in Large Scale Data Integration • ALL of these are imperfect standards • Molfiles • SDF • InChI • JCAMP • But what can be done with them?
  • 51. Compound Data • The standards of chemical structure handling are primarily molfile, SDfile, SMILES, InChI • We primarily depend on molfiles and SDF files for data deposition and interchange • We use InChI a lot – especially for integrated searching across the web
  • 54. Compound Data • The standards of chemical structure handling are primarily molfile, SDfile, SMILES, InChI • We primarily depend on molfiles and SDF files for data deposition and interchange • We use InChI a lot – especially for integrated searching across the web • There ARE data interchange problems associated with structures….
  • 55. USE and TEACH Standards • Too few people are aware of the existing standards and their capabilities • Part of the CINF mission activities should be to teach standards and this is being done • Still too few people have heard of InChI and JCAMP for example • Still little known about the importance of correct structure representations – kudos to people like Leah et al who TEACH THIS!
  • 56. USE and TEACH Standards!
  • 57. USE and TEACH Standards!
  • 58. CVSP: Validate and Standardize
  • 60. CVSP Filtering of DrugBank
  • 64.
  • 65. Contribute to PUBLIC Ontologies • Yes there are “company” ontologies – but for the good of the community contribute to public ontologies and standards • For data interchange and meshing this is soooooo beneficial!
  • 67. Use standards in APIs, endpoints and widgets
  • 69. Actions • Support and encourage new standards • In the meantime, reawaken and modernize the JCAMP standard • Show up and listen to Bob Hanson today • Encourage scientists to provide data
  • 70. Charles Holland Duell in 1902 “…all previous advances in the various lines of invention will appear totally insignificant when compared with those which the present century will witness. I almost wish that I might live my life over again to see the wonders which are at the threshold”
  • 72. Acknowledgments • Daniel Lowe – NextMove, Reactions and Spectra • Bill Brouwer – Plot2Txt Development • Carlos Cobas and Stan Sykora– MestreLabs • The ChemSpider team – led by Richard Kidd • The RSC Data Repository team
  • 73. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

Notes de l'éditeur

  1. All lines up nicely.