SlideShare une entreprise Scribd logo
1  sur  31
Karen Karapetyan, Colin Batchelor,
Jonathan Steele , David Sharpe
Valery Tkachenko, Antony Williams
Building support for the semantic web
for chemistry
at the Royal Society of Chemistry
http://www.openphacts.org
Open PHACTS is an Innovative Medicines
Initiative (IMI) project, aiming to reduce the
barriers to drug discovery in industry, academia
and for small businesses.
Semantic web is one of the corner stones
RDF Export
Data:
ChEMBL
HMDB
DrugBank
Chemistry Validation and Standardization Platform (CVSP)
at cvsp.chemspider.com
• Validation
• Standardization
• Parent generation
• Run on Hadoop-based farm
CVSP : chemical validation
free chemistry validation platform that performs:
• Structure validation
• Atoms
• Bonds
• Valence
• Stereo
• If aromatic - check that uniquely dearomatized
• Strongest acid not ionized first in partially-ionized system
• Cross-matching of SDF fields
• synonyms
• InChIs
• Smiles
Input formats supported:
CDX, Mol,
Sdf
Zip
Gz
Tab-delimited text files
CVSP: standardization
modules
• Custom processing let’s user to put together workflow from pre-defined
standardization modules list
• ChemSpider (passed 100K records)
• All records are planned to pass through CVSP
• DrugBank (~6.5K records)
• ChEMBL (~1.2 mln records)
Data set examples
ChemSpider issues
DrugBank dataset (6516
records)
~60 records that can‟t be dearomatized unambiguously
DB04283 DB04462
~30 records with bonds that do not
make sense
DB04283
DDB04009
2 records where Smiles, InChI, and name did not match
the structure
DB00611 DB01547
~40 records where InChIs did not match the structure
DrugBank ID: DB00755
InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-
20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-
14+
DruGBank ID: DB00614
DB08128
J. Brechner, IUPAC
Graphical Representation of
stereochem. configurations
Section: ST-1.1.10
DB06287
7 records with 2 stereo bonds at chiral
atoms
CVSP validation of ChEMBL 16 (~1.3 mln. records)
• Overall 0.7% of records had validation issues
• Stereo problems (~82%)
• Directions of bonds do not make sense (~63%)
• Ambiguous stereo : 2 stereo bonds at chiral center (~19%)
“Direction of bond makes no sense” –
63%
“Stereo types of the opposite bonds mismatch” -
15%
http://www.iupac.org/publications/pac/2006/pdf/7810x1897.pdf
“Stereo types of non-opposite bonds match” –
2%
“atom not recognized” – 3% isotopes
Should be atom from periodic table
No mass difference in atom line
No “M ISO” in connection table
In molfile:
CVSP : standardization
• Standardization workflow was developed for
Open PHACTS‟s registration system
• Workflow includes modules like
• SMIRKS rules derived from FDA SRS manual
• Resetting symmetric stereo
• Dearomatize
• Layout
• Fix “fixable” stereo issues
• Disconnect all metals from N, O, F
• Fold non-stereo hydrogens
• Handle partial ionization of acid-base
• etc
Open PHACTS chemical registry system:
what we use as chemical identity?
• Standard InChI/InChIKey (currently used ChemSpider)
• Absolute smiles (isomeric canonical)
Drawbacks
• SMILES –many flavors
• Standard InChI
• does not include unknown/undefined stereo unless at least one defined stereo is present
• does not distinguish between undefined and unknown stereo (always “?”)
• standard InChI does some basic tautomer canonicalization which we wanted to prevent
to distinguish between all tautomers (sometimes useful for linking spectral data to
specific tautomer)
• assumes absolute stereo or no stereo at all
Path we took:
Non-standard InChI with options: SUU SLUUD FixedH SUCF
• Always include unknown/undefined stereo („u‟,‟?‟)
• add Fixed H layer (to distinguish between tautomers)
• Uses chiral flag in MOL/SD record (ON – absolute stereo, OFF-
relative)
For each Compound (CSID) parent generation is
attempted
“Tautomerism in large databases”, Sitzmann and
others, J.Comput Aided Mol Des (2010)
Parent Description
Charge-Unsensitive An attempt is made to neutralize ionized acids
and bases. Envisioned to be an ongoing
improvement while new cases appear.
Isotope-Unsensitive Isotopes replaced by common weight
Stereo-Unsensitive Stereo is stripped
Tautomer-Unsensitive Tautomer canonicalization is attempting to
generate a “reasonable” tautomer
Super-Unsensitive This parent is all of the above
No fragment unsensitive parent – we treat all fragments as equal entities
CTAB
REGID1
DataSource
Synonym1
Synonym2
XRef1
etc
Deposited
SDF record
Standardized
entity
OPS_ID1 Super Parent (OPS_ID8)
Parents
Charge Parent (OPS_ID7)
Isotope Parent (OPS_ID5)
Stereo Parent (OPS_ID4)
Tautomer Parent
(OPS_ID6)
Fragment (OPS_ID3)
Fragment (OPS_ID2)
Chemistry Validation and Standardization Platform (CVSP)
at cvsp.chemspider.com
• Validation
• Standardization
• Parent generation
RDF Export
Data
Data is being imported
from ChemSpider to
Open PHACTS in
RDF/turtle
RDF/VoID
– VoID is an RDF Schema vocabulary for expressing metadata
about RDF datasets. It is intended as a bridge between the
publishers and users of RDF data. http://www.w3.org/TR/void
• skos:exactMatch (Simple Knowledge Organisation System)
E.g. To link compounds in OPS with compounds in ChEBI.
• skos:closeMatch
E.g. To link Stereo Insensitive Parents to their Children within OPS.
• skos:relatedMatch
E.g. To link Parent compounds that contain others as Fragments.
– Recommendations on how to create the VoID have been specified by
Manchester here: http://www.cs.man.ac.uk/~graya/ops/2012/ED-
datadesc/
OPS1
DrugBank ID DB07241
OPS5OPS4
OPS3
OPS2
OPS6
ops:OPS1 skos:exactMatch
<http://www4.wiwiss.fu-
berlin.de/drugbank/resource/drugs/DB07241> .
ops:OPS2 skos:relatedMatch ops:OPS1 .
ops:OPS3 skos:relatedMatch ops:OPS1 .
ops:OPS3 skos:closeMatch ops:OPS4 .
ops:OPS3 skos:closeMatch ops:OPS5 .
ops:OPS4 skos:closeMatch ops:OPS6 .
ops:OPS5 skos:closeMatch ops:OPS6 .
Future work
Enabling full semantic web capabilities:
• Establishing RDF server with all relationships
(including parent-child relationships)
• Develop SPARQL capability for querying RDF
Validate all records in ChemSpider by passing it
through CVSP
Feedback
CVSP at
cvsp.chemspider.com

Contenu connexe

Tendances

CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
NextMove Software
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sight
NextMove Software
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
dan2097
 

Tendances (19)

Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...
 
Unlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesUnlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articles
 
Automated Extraction of Reactions from the Patent Literature
Automated Extraction of Reactions from the Patent LiteratureAutomated Extraction of Reactions from the Patent Literature
Automated Extraction of Reactions from the Patent Literature
 
Chemistry and reactions from non-US patents
Chemistry and reactions from non-US patentsChemistry and reactions from non-US patents
Chemistry and reactions from non-US patents
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
 
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
 
Uni protsparqlcloud
Uni protsparqlcloudUni protsparqlcloud
Uni protsparqlcloud
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sight
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
 
In grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionIn grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solution
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 

En vedette

John Madayese Research- Barriers to Prosperity in Africa Unchaining the Eco...
John Madayese Research- Barriers to Prosperity in Africa   Unchaining the Eco...John Madayese Research- Barriers to Prosperity in Africa   Unchaining the Eco...
John Madayese Research- Barriers to Prosperity in Africa Unchaining the Eco...
John Oluwashola Madayese
 
Open classroom health policy - session 10.16 - iselin and young
Open classroom   health policy - session 10.16 - iselin and youngOpen classroom   health policy - session 10.16 - iselin and young
Open classroom health policy - session 10.16 - iselin and young
Brian Young
 
A pérola, de John Steinbeck
A pérola, de John SteinbeckA pérola, de John Steinbeck
A pérola, de John Steinbeck
esodateliesbe
 

En vedette (13)

Thinking In Swift
Thinking In SwiftThinking In Swift
Thinking In Swift
 
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
 
John Madayese Research- Barriers to Prosperity in Africa Unchaining the Eco...
John Madayese Research- Barriers to Prosperity in Africa   Unchaining the Eco...John Madayese Research- Barriers to Prosperity in Africa   Unchaining the Eco...
John Madayese Research- Barriers to Prosperity in Africa Unchaining the Eco...
 
Open classroom health policy - session 10.16 - iselin and young
Open classroom   health policy - session 10.16 - iselin and youngOpen classroom   health policy - session 10.16 - iselin and young
Open classroom health policy - session 10.16 - iselin and young
 
PROCESO DE COMPRA
PROCESO DE COMPRA PROCESO DE COMPRA
PROCESO DE COMPRA
 
The State of Search: 2016 Edition
The State of Search: 2016 EditionThe State of Search: 2016 Edition
The State of Search: 2016 Edition
 
The Importance of Proper Windshield Replacement
The Importance of Proper Windshield ReplacementThe Importance of Proper Windshield Replacement
The Importance of Proper Windshield Replacement
 
A pérola, de John Steinbeck
A pérola, de John SteinbeckA pérola, de John Steinbeck
A pérola, de John Steinbeck
 
Meeting of Company
Meeting of CompanyMeeting of Company
Meeting of Company
 
ppt on meeting and resolution
ppt on meeting and resolutionppt on meeting and resolution
ppt on meeting and resolution
 
Top Fears of Lawyers [Infographic]
Top Fears of Lawyers [Infographic]Top Fears of Lawyers [Infographic]
Top Fears of Lawyers [Infographic]
 
Complete Guide to Seo Footprints
Complete Guide to Seo FootprintsComplete Guide to Seo Footprints
Complete Guide to Seo Footprints
 
Guided Reading: Making the Most of It
Guided Reading: Making the Most of ItGuided Reading: Making the Most of It
Guided Reading: Making the Most of It
 

Similaire à Acs 2013 indianapolis_cvsp

ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Similaire à Acs 2013 indianapolis_cvsp (20)

Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
 
ChemValidator – an online service for validating and standardizing chemical s...
ChemValidator – an online service for validating and standardizing chemical s...ChemValidator – an online service for validating and standardizing chemical s...
ChemValidator – an online service for validating and standardizing chemical s...
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug Discovery
 
ChemSpider – An Online Database and Registration System Linking the Web
ChemSpider – An Online Database and  Registration System Linking the WebChemSpider – An Online Database and  Registration System Linking the Web
ChemSpider – An Online Database and Registration System Linking the Web
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
Taming The Wild West Of Internet Based Chemistry You Can Help
Taming The Wild West Of Internet Based Chemistry You Can HelpTaming The Wild West Of Internet Based Chemistry You Can Help
Taming The Wild West Of Internet Based Chemistry You Can Help
 
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
20130724 cisrg sugars_batchelor
20130724 cisrg sugars_batchelor20130724 cisrg sugars_batchelor
20130724 cisrg sugars_batchelor
 
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
 
Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider
 
AZ of Chemspider February 2011
AZ of Chemspider February 2011AZ of Chemspider February 2011
AZ of Chemspider February 2011
 

Plus de Ken Karapetyan

Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Ken Karapetyan
 

Plus de Ken Karapetyan (11)

ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...
 
The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...
 
Digitally enabling the RSC archive
Digitally enabling the RSC archiveDigitally enabling the RSC archive
Digitally enabling the RSC archive
 
Royal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discoveryRoyal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discovery
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Data enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archiveData enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archive
 
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
SERMACS 2012
SERMACS 2012SERMACS 2012
SERMACS 2012
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Acs 2013 indianapolis_cvsp

  • 1. Karen Karapetyan, Colin Batchelor, Jonathan Steele , David Sharpe Valery Tkachenko, Antony Williams Building support for the semantic web for chemistry at the Royal Society of Chemistry
  • 2.
  • 3. http://www.openphacts.org Open PHACTS is an Innovative Medicines Initiative (IMI) project, aiming to reduce the barriers to drug discovery in industry, academia and for small businesses. Semantic web is one of the corner stones
  • 4. RDF Export Data: ChEMBL HMDB DrugBank Chemistry Validation and Standardization Platform (CVSP) at cvsp.chemspider.com • Validation • Standardization • Parent generation • Run on Hadoop-based farm
  • 5. CVSP : chemical validation free chemistry validation platform that performs: • Structure validation • Atoms • Bonds • Valence • Stereo • If aromatic - check that uniquely dearomatized • Strongest acid not ionized first in partially-ionized system • Cross-matching of SDF fields • synonyms • InChIs • Smiles
  • 6. Input formats supported: CDX, Mol, Sdf Zip Gz Tab-delimited text files
  • 7. CVSP: standardization modules • Custom processing let’s user to put together workflow from pre-defined standardization modules list
  • 8.
  • 9. • ChemSpider (passed 100K records) • All records are planned to pass through CVSP • DrugBank (~6.5K records) • ChEMBL (~1.2 mln records) Data set examples
  • 11. DrugBank dataset (6516 records) ~60 records that can‟t be dearomatized unambiguously DB04283 DB04462
  • 12. ~30 records with bonds that do not make sense DB04283 DDB04009
  • 13. 2 records where Smiles, InChI, and name did not match the structure DB00611 DB01547
  • 14. ~40 records where InChIs did not match the structure DrugBank ID: DB00755 InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13- 20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16- 14+ DruGBank ID: DB00614
  • 15. DB08128 J. Brechner, IUPAC Graphical Representation of stereochem. configurations Section: ST-1.1.10 DB06287 7 records with 2 stereo bonds at chiral atoms
  • 16. CVSP validation of ChEMBL 16 (~1.3 mln. records) • Overall 0.7% of records had validation issues • Stereo problems (~82%) • Directions of bonds do not make sense (~63%) • Ambiguous stereo : 2 stereo bonds at chiral center (~19%)
  • 17. “Direction of bond makes no sense” – 63%
  • 18. “Stereo types of the opposite bonds mismatch” - 15% http://www.iupac.org/publications/pac/2006/pdf/7810x1897.pdf
  • 19. “Stereo types of non-opposite bonds match” – 2%
  • 20. “atom not recognized” – 3% isotopes Should be atom from periodic table No mass difference in atom line No “M ISO” in connection table In molfile:
  • 21. CVSP : standardization • Standardization workflow was developed for Open PHACTS‟s registration system • Workflow includes modules like • SMIRKS rules derived from FDA SRS manual • Resetting symmetric stereo • Dearomatize • Layout • Fix “fixable” stereo issues • Disconnect all metals from N, O, F • Fold non-stereo hydrogens • Handle partial ionization of acid-base • etc
  • 22. Open PHACTS chemical registry system: what we use as chemical identity? • Standard InChI/InChIKey (currently used ChemSpider) • Absolute smiles (isomeric canonical) Drawbacks • SMILES –many flavors • Standard InChI • does not include unknown/undefined stereo unless at least one defined stereo is present • does not distinguish between undefined and unknown stereo (always “?”) • standard InChI does some basic tautomer canonicalization which we wanted to prevent to distinguish between all tautomers (sometimes useful for linking spectral data to specific tautomer) • assumes absolute stereo or no stereo at all Path we took: Non-standard InChI with options: SUU SLUUD FixedH SUCF • Always include unknown/undefined stereo („u‟,‟?‟) • add Fixed H layer (to distinguish between tautomers) • Uses chiral flag in MOL/SD record (ON – absolute stereo, OFF- relative)
  • 23. For each Compound (CSID) parent generation is attempted “Tautomerism in large databases”, Sitzmann and others, J.Comput Aided Mol Des (2010) Parent Description Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear. Isotope-Unsensitive Isotopes replaced by common weight Stereo-Unsensitive Stereo is stripped Tautomer-Unsensitive Tautomer canonicalization is attempting to generate a “reasonable” tautomer Super-Unsensitive This parent is all of the above No fragment unsensitive parent – we treat all fragments as equal entities
  • 24. CTAB REGID1 DataSource Synonym1 Synonym2 XRef1 etc Deposited SDF record Standardized entity OPS_ID1 Super Parent (OPS_ID8) Parents Charge Parent (OPS_ID7) Isotope Parent (OPS_ID5) Stereo Parent (OPS_ID4) Tautomer Parent (OPS_ID6) Fragment (OPS_ID3) Fragment (OPS_ID2)
  • 25. Chemistry Validation and Standardization Platform (CVSP) at cvsp.chemspider.com • Validation • Standardization • Parent generation RDF Export Data
  • 26. Data is being imported from ChemSpider to Open PHACTS in RDF/turtle
  • 27. RDF/VoID – VoID is an RDF Schema vocabulary for expressing metadata about RDF datasets. It is intended as a bridge between the publishers and users of RDF data. http://www.w3.org/TR/void • skos:exactMatch (Simple Knowledge Organisation System) E.g. To link compounds in OPS with compounds in ChEBI. • skos:closeMatch E.g. To link Stereo Insensitive Parents to their Children within OPS. • skos:relatedMatch E.g. To link Parent compounds that contain others as Fragments. – Recommendations on how to create the VoID have been specified by Manchester here: http://www.cs.man.ac.uk/~graya/ops/2012/ED- datadesc/
  • 28. OPS1 DrugBank ID DB07241 OPS5OPS4 OPS3 OPS2 OPS6 ops:OPS1 skos:exactMatch <http://www4.wiwiss.fu- berlin.de/drugbank/resource/drugs/DB07241> . ops:OPS2 skos:relatedMatch ops:OPS1 . ops:OPS3 skos:relatedMatch ops:OPS1 . ops:OPS3 skos:closeMatch ops:OPS4 . ops:OPS3 skos:closeMatch ops:OPS5 . ops:OPS4 skos:closeMatch ops:OPS6 . ops:OPS5 skos:closeMatch ops:OPS6 .
  • 29.
  • 30. Future work Enabling full semantic web capabilities: • Establishing RDF server with all relationships (including parent-child relationships) • Develop SPARQL capability for querying RDF Validate all records in ChemSpider by passing it through CVSP