SlideShare une entreprise Scribd logo
1  sur  59
Chemicals, Chemical Identifiers and
Navigating Through Databases
Antony Williams
UNC Chapel Hill, October 2010
Chemistry on the Internet
 Where do you source chemistry information?
 What can you trust online?
 How can you recognize potential issues?
 Cross-referencing and curating data
What is the Structure of Vitamin K?
MeSH
 A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAMIN K 1 (phytomenadione) derived
from plants, VITAMIN K 2 (menaquinone) from
bacteria, and synthetic naphthoquinone provitamins,
VITAMIN K 3 (menadione). Vitamin K 3 provitamins,
after being alkylated in vivo, exhibit the
antifibrinolytic activity of vitamin K. Green leafy
vegetables, liver, cheese, butter, and egg yolk are
good sources of vitamin K
What is the Structure of Vitamin K1?
Wikipedia
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
PubChem
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-
enyl)naphthalene-1,4-dione”
 Variants of systematic names on PubChem
 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl
 2-methyl-3-(3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl
Bioassay Data are Associated…
Lack of Stereochemistry
ChEBI – Manual Curation
Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)
Molfiles
 10 9 0 0 1 0 0 0 0 0 1 V2000
 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
 3 1 2 0 0 0 0
 4 1 1 0 0 0 0
 9 1 1 0 0 0 0
 7 2 1 0 0 0 0
 5 2 2 0 0 0 0
 8 2 1 0 0 0 0
 6 4 1 0 0 0 0
 4 10 1 6 0 0 0
 7 6 1 0 0 0 0
 M END
Molfiles
 Molfiles are the primary exchange format between
structure drawing packages
 Can be different between different drawing packages
 Most commonly carry X,Y coordinates for layout
 Can support polymers, organometallics, etc.
 Can carry 3D coordinates
SMILES (http://en.wikipedia.org/wiki/SMILES)
 SMILES is a common format
 Can support polymers,
organometallics, etc.
 Does NOT carry X,Y or Z
coordinates for layout so
requires layout algorithms –
can be problematic!
 Generally different between
drawing packages
Stereo
Tautomers
SMILES
 ACD/Labs
 CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2=O
 OpenEye
 CC1=C(C(=O)c2ccccc2C1=O)C/C=C(C)/CCC[C
@H](C)CCC[C@H](C)CCCC(C)C
 ChEMBL
 CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(=CCC1=C(C)C(=O)c2ccccc2C1=O)C
The InChI Identifier
InChI
 SINGLE code base managed by IUPAC –
integrated into drawing packages. No variability
as with SMILES
 InChI Strings can be reversed to structures –
same problem as with SMILES – no layout
 Well adopted by the community (databases,
publishers, blogs, Wikipedia) – good for searching
the internet
Multiple Layers
Tautomers – “Mobile H Perception”
Double Bond Orientation
Stereo
Checking for Stereochemistry
Checking for Stereochemistry
Use your drawing package!
Checking for Stereochemistry
Checking for Stereochemistry
Checking for Stereochemistry
InChIStrings Hash to InChIKeys
PubChem InChIKeys
 MBWXNTAXLNYFJB-NKFFZRIASA-N
 MBWXNTAXLNYFJB-LKUDQCMESA-N
 MBWXNTAXLNYFJB-UHFFFAOYSA-N
 MBWXNTAXLNYFJB-FAKCLFGASA-N
 MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label)
 MBWXNTAXLNYFJB-ODDKJFTJSA-N
 MBWXNTAXLNYFJB-KSVLJPARSA-N
 MBWXNTAXLNYFJB-UDCSOKOMSA-N
 MBWXNTAXLNYFJB-JHBCSKSVSA-N
 MBWXNTAXLNYFJB-JXAKDHTRSA-N
PubChem InChIKeys
 MBWXNTAXLNYFJB-NKFFZRIASA-N
 MBWXNTAXLNYFJB-LKUDQCMESA-N
 MBWXNTAXLNYFJB-UHFFFAOYSA-N
 MBWXNTAXLNYFJB-FAKCLFGASA-N
 MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label)
 MBWXNTAXLNYFJB-ODDKJFTJSA-N
 MBWXNTAXLNYFJB-KSVLJPARSA-N
 MBWXNTAXLNYFJB-UDCSOKOMSA-N
 MBWXNTAXLNYFJB-JHBCSKSVSA-N
 MBWXNTAXLNYFJB-JXAKDHTRSA-N
Databases and Standardization
Databases and Standardization
InChI
 No support for polymers, organometallics
 Many option settings can lead to variability and
make integration across databases difficult –
FixedH option especially problematic
 “Slight” chance of collisions of InChIKeys
 VERY USEFUL FOR INTEGRATING THE WEB
Vancomycin
Vancomycin
Search Molecular
SKELETON
Search Full Molecule
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
Where is chemistry online?
 Encyclopedic articles (Wikipedia)
 Chemical vendor databases
 Metabolic pathway databases
 Property databases
 Patents with chemical structures
 Drug Discovery data
 Scientific publications
 Compound aggregators
 Blogs/Wikis and Open Notebook Science
Linked Data on the Web
Taken from: Rafael Sidis’ Blog
www.chemspider.com
Search for a Chemical…by name
Available Information…
 Linked to vendors, safety data, toxicity, metabolism
How do we build it?
 25 million chemicals from 400 data sources
 We deal in Molfiles or SDF files – including
coordinates
 We do rudimentary filtering – valence checking,
charge imbalance – prior to deposition
 We have our own “business logic” to standardize
 We use InChI to “aggregate tautomers” to one
record
 We link out to external sites where possible using
their IDs
Inherited Errors
 We have inherited errors from every database…
all public compound databases, including ours,
have errors
 “Incorrect” structures – assertions, timelines etc
 “Incorrect” names associated with structures
 Properties
 Links
 Publications
 ENORMOUS CHALLENGE
Compounds and Identifiers
Be careful searching by Name!
 Determining the correct structure by name
searching is difficult online! Good, not perfect
 Wikipedia
 ChEBI/ChEMBL
 ChemIDPlus
 ChemSpider
 Be VERY careful with MOST databases
Validating structures
 Check for “full stereo” and use stereo descriptors
especially for checking!
 Check for quality of associated data sources
 Check against reference literature when available
– but it can be wrong
 Question EVERYTHING!
Online Curation
 Online databases generally do NOT allow
curation or annotation
 If you find errors they stay there!
 ChemSpider is unique…immediate curation
 ChemSpider live demo following this lecture
 Searching
 Deposition and Curation
 ChemSpider SyntheticPages
Thank you
Email: williamsa@rsc.org
Twitter: ChemConnector
Blog: www.chemspider.com/blog
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Contenu connexe

Similaire à Chemicals, Chemical Identifiers and Navigating Through Databases

ACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF TalkACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF TalkMarkus Sitzmann
 
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...guest01a117
 
Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptxwadhava gurumeet
 
2011 ebi industry workshop
2011 ebi industry workshop2011 ebi industry workshop
2011 ebi industry workshopMichel Dumontier
 

Similaire à Chemicals, Chemical Identifiers and Navigating Through Databases (20)

ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
 
Online Public Compound Databases
Online Public Compound DatabasesOnline Public Compound Databases
Online Public Compound Databases
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
 
Navigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpiderNavigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpider
 
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
 
ACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF TalkACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF Talk
 
Why Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpiderWhy Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpider
 
ChemSpider -Connecting and Curating Online Chemistry Resources
ChemSpider -Connecting and Curating Online Chemistry ResourcesChemSpider -Connecting and Curating Online Chemistry Resources
ChemSpider -Connecting and Curating Online Chemistry Resources
 
Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
 
Ebi public meeting on internet chemistry databases november 2010
Ebi public meeting on internet chemistry databases november 2010Ebi public meeting on internet chemistry databases november 2010
Ebi public meeting on internet chemistry databases november 2010
 
Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider
 
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
 
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
ChemSpider  and How The Wisdom Of The  Crowds  Can  Improve The  Quality Of  ...ChemSpider  and How The Wisdom Of The  Crowds  Can  Improve The  Quality Of  ...
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
 
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
 
Taming The Wild West Of Internet Based Chemistry You Can Help
Taming The Wild West Of Internet Based Chemistry You Can HelpTaming The Wild West Of Internet Based Chemistry You Can Help
Taming The Wild West Of Internet Based Chemistry You Can Help
 
Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...
 
Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptx
 
2011 ebi industry workshop
2011 ebi industry workshop2011 ebi industry workshop
2011 ebi industry workshop
 
Web Crawling Chemistry
Web Crawling ChemistryWeb Crawling Chemistry
Web Crawling Chemistry
 
Whitney Symposium Lecture June 2008
Whitney Symposium Lecture June 2008Whitney Symposium Lecture June 2008
Whitney Symposium Lecture June 2008
 

Dernier

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Dernier (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Chemicals, Chemical Identifiers and Navigating Through Databases

  • 1. Chemicals, Chemical Identifiers and Navigating Through Databases Antony Williams UNC Chapel Hill, October 2010
  • 2. Chemistry on the Internet  Where do you source chemistry information?  What can you trust online?  How can you recognize potential issues?  Cross-referencing and curating data
  • 3. What is the Structure of Vitamin K?
  • 4. MeSH  A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
  • 5. What is the Structure of Vitamin K1?
  • 7. What is the Structure of Vitamin K1?
  • 10.
  • 11. “2-methyl-3-(3,7,11,15-tetramethylhexadec-2- enyl)naphthalene-1,4-dione”  Variants of systematic names on PubChem  2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl  2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl  2-methyl-3-[(E)-3,7,11,15-tetramethyl  2-methyl-3-(3,7,11,15-tetramethyl  2-methyl-3-[(E)-3,7,11,15-tetramethyl
  • 12. Bioassay Data are Associated…
  • 13.
  • 15. ChEBI – Manual Curation
  • 16.
  • 17.
  • 18.
  • 20. Molfiles  10 9 0 0 1 0 0 0 0 0 1 V2000  31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0  30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0  28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0  32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0  30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0  3 1 2 0 0 0 0  4 1 1 0 0 0 0  9 1 1 0 0 0 0  7 2 1 0 0 0 0  5 2 2 0 0 0 0  8 2 1 0 0 0 0  6 4 1 0 0 0 0  4 10 1 6 0 0 0  7 6 1 0 0 0 0  M END
  • 21. Molfiles  Molfiles are the primary exchange format between structure drawing packages  Can be different between different drawing packages  Most commonly carry X,Y coordinates for layout  Can support polymers, organometallics, etc.  Can carry 3D coordinates
  • 22. SMILES (http://en.wikipedia.org/wiki/SMILES)  SMILES is a common format  Can support polymers, organometallics, etc.  Does NOT carry X,Y or Z coordinates for layout so requires layout algorithms – can be problematic!  Generally different between drawing packages
  • 25. SMILES  ACD/Labs  CC(C)CCC[C@@H](C)CCC[C@@H] (C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2=O  OpenEye  CC1=C(C(=O)c2ccccc2C1=O)C/C=C(C)/CCC[C @H](C)CCC[C@H](C)CCCC(C)C  ChEMBL  CC(C)CCC[C@@H](C)CCC[C@@H] (C)CCCC(=CCC1=C(C)C(=O)c2ccccc2C1=O)C
  • 27. InChI  SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES  InChI Strings can be reversed to structures – same problem as with SMILES – no layout  Well adopted by the community (databases, publishers, blogs, Wikipedia) – good for searching the internet
  • 29. Tautomers – “Mobile H Perception”
  • 33. Checking for Stereochemistry Use your drawing package!
  • 37. InChIStrings Hash to InChIKeys
  • 38.
  • 39. PubChem InChIKeys  MBWXNTAXLNYFJB-NKFFZRIASA-N  MBWXNTAXLNYFJB-LKUDQCMESA-N  MBWXNTAXLNYFJB-UHFFFAOYSA-N  MBWXNTAXLNYFJB-FAKCLFGASA-N  MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label)  MBWXNTAXLNYFJB-ODDKJFTJSA-N  MBWXNTAXLNYFJB-KSVLJPARSA-N  MBWXNTAXLNYFJB-UDCSOKOMSA-N  MBWXNTAXLNYFJB-JHBCSKSVSA-N  MBWXNTAXLNYFJB-JXAKDHTRSA-N
  • 40. PubChem InChIKeys  MBWXNTAXLNYFJB-NKFFZRIASA-N  MBWXNTAXLNYFJB-LKUDQCMESA-N  MBWXNTAXLNYFJB-UHFFFAOYSA-N  MBWXNTAXLNYFJB-FAKCLFGASA-N  MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label)  MBWXNTAXLNYFJB-ODDKJFTJSA-N  MBWXNTAXLNYFJB-KSVLJPARSA-N  MBWXNTAXLNYFJB-UDCSOKOMSA-N  MBWXNTAXLNYFJB-JHBCSKSVSA-N  MBWXNTAXLNYFJB-JXAKDHTRSA-N
  • 43. InChI  No support for polymers, organometallics  Many option settings can lead to variability and make integration across databases difficult – FixedH option especially problematic  “Slight” chance of collisions of InChIKeys  VERY USEFUL FOR INTEGRATING THE WEB
  • 48. Where is chemistry online?  Encyclopedic articles (Wikipedia)  Chemical vendor databases  Metabolic pathway databases  Property databases  Patents with chemical structures  Drug Discovery data  Scientific publications  Compound aggregators  Blogs/Wikis and Open Notebook Science
  • 49. Linked Data on the Web Taken from: Rafael Sidis’ Blog
  • 51. Search for a Chemical…by name
  • 52. Available Information…  Linked to vendors, safety data, toxicity, metabolism
  • 53. How do we build it?  25 million chemicals from 400 data sources  We deal in Molfiles or SDF files – including coordinates  We do rudimentary filtering – valence checking, charge imbalance – prior to deposition  We have our own “business logic” to standardize  We use InChI to “aggregate tautomers” to one record  We link out to external sites where possible using their IDs
  • 54. Inherited Errors  We have inherited errors from every database… all public compound databases, including ours, have errors  “Incorrect” structures – assertions, timelines etc  “Incorrect” names associated with structures  Properties  Links  Publications  ENORMOUS CHALLENGE
  • 56. Be careful searching by Name!  Determining the correct structure by name searching is difficult online! Good, not perfect  Wikipedia  ChEBI/ChEMBL  ChemIDPlus  ChemSpider  Be VERY careful with MOST databases
  • 57. Validating structures  Check for “full stereo” and use stereo descriptors especially for checking!  Check for quality of associated data sources  Check against reference literature when available – but it can be wrong  Question EVERYTHING!
  • 58. Online Curation  Online databases generally do NOT allow curation or annotation  If you find errors they stay there!  ChemSpider is unique…immediate curation  ChemSpider live demo following this lecture  Searching  Deposition and Curation  ChemSpider SyntheticPages
  • 59. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams