SlideShare une entreprise Scribd logo
1  sur  14
Standardization and Generation of Parents
for
Open PHACTS Chemical Registry System
Karen Karapetyan, Valery Tkachenko
Colin Batchelor, Antony Williams
Validation checks
 Correct file format (SDF, MOL, CDX, etc)
 “Valid” chemical structure
 Valid atoms (not query atoms)
 Valid bonds
 Valid valences
 Valid charges
 SP3 stereo
 Synonyms
 Names (name to structure)
 SMILES, InChIs (SMILES/InChI to structure)
 XRefs
Severity assigned to every validation issue
Filtering by severity and by issues
Standardization – Organometallics/Salts
 Always disconnect N, O, and F from metals:
 Disconnect nonmetals (except N,O,F) with transition metals (except Hg)
 Ionize free metal with carboxylic acid (Metals of Group I and II)
Standardization SMIRKS
(based on InChI normalization and on FDA SRS)
Examples of InChI normalization
 [*;H+:1]>>[*;H:1]
 [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3]
>>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]
 [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2]
Examples of FDA SRS rules
 [n:1]=[O:2]>>[n+:1][O-:2]
 [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]
 [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]
 Thiopurine
[H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[
H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3]
1=[S:2]
Standardization
 Dearomatize
 Double bond with adjacent wiggly single bond
 Fold hydrogen atoms with no up or down bonds
Standardization
 Remove symmetric stereocenters
 Turn off chiral flag if no up or down bonds
 Do Layout
Chiral flag is set
Standardization – partially ionized acids
(move proton from strong acids to a weaker)
For each Compound parent generation is attempted
“Tautomerism in large databases”, Sitzmann and others,
J.Comput Aided Mol Des (2010)
Parent Description RDF
Charge-Unsensitive An attempt is made to neutralize ionized acids
and bases. Envisioned to be an ongoing
improvement while new cases appear.
void:linkPredicate skos:closeMatch
dul:expresses cheminf:CHEMINF_000460;
Isotope-Unsensitive Isotopes replaced by common weight void:linkPredicate skos:closeMatch;
dul:expresses cheminf:CHEMINF_000459
Stereo-Unsensitive SP3 and double bond stereo removed void:linkPredicate skos:closeMatch
cheminf:CHEMINF_000456
Tautomer-
Unsensitive
Tautomer canonicalization is attempting to
generate a canonical tautomer
void:linkPredicate skos:closeMatch;
dul:expresses cheminf:CHEMINF_000486;
Super Parent Super parent is generated by applying
modifications of all of the above
void:linkPredicate skos:broadMatch;
dul:expresses cheminf:CHEMINF_000458;
Fragment
SID 1
SDF1
DataSource1
Synonym1
Synonym2
XRef1
SID 2
SDF2
DataSource2
Synonym1
Synonym3
XRef2
OPS_ID 1
Deposited
Substances
Parents
Standardized
MOLECULE
DataSource1
DataSource2
Synonym1
Synonym2
Synonym3
XRef1
XRef2
Charge Parent (OPS_ID 6)
Isotope Parent (OPS_ID 4)
Stereo Parent (OPS_ID 3)
Tautomer Parent (OPS_ID 5)
Super Parent (OPS_ID 7)
Compounds
OPS_ID 2
Standardized
MOL
DataSource3
DataSource4
Synonym4
Synonym5
Synonym6
XRef3
XRef4
What do we use as chemical identity of the standardized records
(primary compound key)?
• Standard InChI/InChIKey (currently used ChemSpider)
• Absolute smiles (isomeric canonical)
Drawbacks
• SMILES – can be too long; no accepted standard; needs to be hashed
• Standard InChI
• does not distinguish between undefined and unknown stereo
• by default standard InChI does some basic tautomer canonicalization
(not needed in new model)
• By default assumes absolute stereo
Proposed Solution
Non-standard InChI with options: SUU SLUUD FixedH SUCF
• much more sensitive to stereo description
• Fixes mobile hydrogens (so tautomers could be distinguished)
• Handles “AND-ed” relative stereo
Thanks
We would appreciate any comments.
For comments or questions email
karapetyank@rsc.org

Contenu connexe

Tendances

CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
NextMove Software
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sight
NextMove Software
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
dan2097
 

Tendances (17)

Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...
 
Unlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesUnlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articles
 
Chemistry and reactions from non-US patents
Chemistry and reactions from non-US patentsChemistry and reactions from non-US patents
Chemistry and reactions from non-US patents
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
 
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sight
 
Uni protsparqlcloud
Uni protsparqlcloudUni protsparqlcloud
Uni protsparqlcloud
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
 
In grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionIn grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solution
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 

En vedette

Grafico diario del dax perfomance index para el 07 11-2013
Grafico diario del dax perfomance index para el 07 11-2013Grafico diario del dax perfomance index para el 07 11-2013
Grafico diario del dax perfomance index para el 07 11-2013
Experiencia Trading
 
أهمية الوقت
أهمية الوقتأهمية الوقت
أهمية الوقت
Sabry Zein
 
Top-Notch Slimmest Smartphones on Earth
Top-Notch Slimmest Smartphones on EarthTop-Notch Slimmest Smartphones on Earth
Top-Notch Slimmest Smartphones on Earth
TechAhead
 

En vedette (20)

Universidad Nacional de Chimborazo Proyecto de Estadistica
Universidad Nacional de Chimborazo   Proyecto de EstadisticaUniversidad Nacional de Chimborazo   Proyecto de Estadistica
Universidad Nacional de Chimborazo Proyecto de Estadistica
 
Rump : iOS patch diffing
Rump : iOS patch diffingRump : iOS patch diffing
Rump : iOS patch diffing
 
Grafico diario del dax perfomance index para el 07 11-2013
Grafico diario del dax perfomance index para el 07 11-2013Grafico diario del dax perfomance index para el 07 11-2013
Grafico diario del dax perfomance index para el 07 11-2013
 
Digital Marketing and Social Personal Media
Digital Marketing and Social Personal MediaDigital Marketing and Social Personal Media
Digital Marketing and Social Personal Media
 
AgTechXChange
AgTechXChangeAgTechXChange
AgTechXChange
 
Enhancing the intranet with gamification
Enhancing the intranet with gamificationEnhancing the intranet with gamification
Enhancing the intranet with gamification
 
أهمية الوقت
أهمية الوقتأهمية الوقت
أهمية الوقت
 
Keene Neighborhood
Keene NeighborhoodKeene Neighborhood
Keene Neighborhood
 
JavaFund
JavaFundJavaFund
JavaFund
 
Planning and development club, November 2016
Planning and development club, November 2016Planning and development club, November 2016
Planning and development club, November 2016
 
Top-Notch Slimmest Smartphones on Earth
Top-Notch Slimmest Smartphones on EarthTop-Notch Slimmest Smartphones on Earth
Top-Notch Slimmest Smartphones on Earth
 
Rosie Clarke - Answer me this!
Rosie Clarke - Answer me this!Rosie Clarke - Answer me this!
Rosie Clarke - Answer me this!
 
China: kicking the can down the road
China: kicking the can down the roadChina: kicking the can down the road
China: kicking the can down the road
 
News A 40 2016
News A 40 2016News A 40 2016
News A 40 2016
 
פרויקט בחוף בת ים
פרויקט בחוף בת יםפרויקט בחוף בת ים
פרויקט בחוף בת ים
 
OXO Soluitions
OXO SoluitionsOXO Soluitions
OXO Soluitions
 
กรอบไทย
กรอบไทยกรอบไทย
กรอบไทย
 
Web security - Presented to the Shelbyville Rotary November 2014
Web security - Presented to the Shelbyville Rotary November 2014Web security - Presented to the Shelbyville Rotary November 2014
Web security - Presented to the Shelbyville Rotary November 2014
 
Technology & Us
Technology & UsTechnology & Us
Technology & Us
 
ECRI INSTITUTE - Monitores Fetales, Parte I
ECRI INSTITUTE - Monitores Fetales, Parte IECRI INSTITUTE - Monitores Fetales, Parte I
ECRI INSTITUTE - Monitores Fetales, Parte I
 

Plus de Ken Karapetyan

Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Ken Karapetyan
 

Plus de Ken Karapetyan (13)

ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...
 
The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...
 
Digitally enabling the RSC archive
Digitally enabling the RSC archiveDigitally enabling the RSC archive
Digitally enabling the RSC archive
 
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
 
Royal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discoveryRoyal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discovery
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Data enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archiveData enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archive
 
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
SERMACS 2012
SERMACS 2012SERMACS 2012
SERMACS 2012
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Standardization and Generation of Parents for Open PHACTS Chemical Registry System

  • 1. Standardization and Generation of Parents for Open PHACTS Chemical Registry System Karen Karapetyan, Valery Tkachenko Colin Batchelor, Antony Williams
  • 2. Validation checks  Correct file format (SDF, MOL, CDX, etc)  “Valid” chemical structure  Valid atoms (not query atoms)  Valid bonds  Valid valences  Valid charges  SP3 stereo  Synonyms  Names (name to structure)  SMILES, InChIs (SMILES/InChI to structure)  XRefs
  • 3. Severity assigned to every validation issue
  • 4. Filtering by severity and by issues
  • 5. Standardization – Organometallics/Salts  Always disconnect N, O, and F from metals:  Disconnect nonmetals (except N,O,F) with transition metals (except Hg)  Ionize free metal with carboxylic acid (Metals of Group I and II)
  • 6. Standardization SMIRKS (based on InChI normalization and on FDA SRS) Examples of InChI normalization  [*;H+:1]>>[*;H:1]  [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3] >>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]  [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2] Examples of FDA SRS rules  [n:1]=[O:2]>>[n+:1][O-:2]  [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]  [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]  Thiopurine [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[ H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3] 1=[S:2]
  • 7. Standardization  Dearomatize  Double bond with adjacent wiggly single bond  Fold hydrogen atoms with no up or down bonds
  • 8. Standardization  Remove symmetric stereocenters  Turn off chiral flag if no up or down bonds  Do Layout Chiral flag is set
  • 9. Standardization – partially ionized acids (move proton from strong acids to a weaker)
  • 10. For each Compound parent generation is attempted “Tautomerism in large databases”, Sitzmann and others, J.Comput Aided Mol Des (2010) Parent Description RDF Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear. void:linkPredicate skos:closeMatch dul:expresses cheminf:CHEMINF_000460; Isotope-Unsensitive Isotopes replaced by common weight void:linkPredicate skos:closeMatch; dul:expresses cheminf:CHEMINF_000459 Stereo-Unsensitive SP3 and double bond stereo removed void:linkPredicate skos:closeMatch cheminf:CHEMINF_000456 Tautomer- Unsensitive Tautomer canonicalization is attempting to generate a canonical tautomer void:linkPredicate skos:closeMatch; dul:expresses cheminf:CHEMINF_000486; Super Parent Super parent is generated by applying modifications of all of the above void:linkPredicate skos:broadMatch; dul:expresses cheminf:CHEMINF_000458;
  • 11. Fragment SID 1 SDF1 DataSource1 Synonym1 Synonym2 XRef1 SID 2 SDF2 DataSource2 Synonym1 Synonym3 XRef2 OPS_ID 1 Deposited Substances Parents Standardized MOLECULE DataSource1 DataSource2 Synonym1 Synonym2 Synonym3 XRef1 XRef2 Charge Parent (OPS_ID 6) Isotope Parent (OPS_ID 4) Stereo Parent (OPS_ID 3) Tautomer Parent (OPS_ID 5) Super Parent (OPS_ID 7) Compounds OPS_ID 2 Standardized MOL DataSource3 DataSource4 Synonym4 Synonym5 Synonym6 XRef3 XRef4
  • 12.
  • 13. What do we use as chemical identity of the standardized records (primary compound key)? • Standard InChI/InChIKey (currently used ChemSpider) • Absolute smiles (isomeric canonical) Drawbacks • SMILES – can be too long; no accepted standard; needs to be hashed • Standard InChI • does not distinguish between undefined and unknown stereo • by default standard InChI does some basic tautomer canonicalization (not needed in new model) • By default assumes absolute stereo Proposed Solution Non-standard InChI with options: SUU SLUUD FixedH SUCF • much more sensitive to stereo description • Fixes mobile hydrogens (so tautomers could be distinguished) • Handles “AND-ed” relative stereo
  • 14. Thanks We would appreciate any comments. For comments or questions email karapetyank@rsc.org

Notes de l'éditeur

  1. I would like to start by defining what a “quality” record means, because that is what validation part of the CVSP is about. The chemical record has several aspect to its quality. One that is easiest to check is file format correctness. Each file format has its own formatting rules that record in that format needs to follow. This type of file validation is done by all the database maintainers that have deposition systems.Another, more relevant, type of validation is the chemical validation. A record can be perfectly formatted from file format point of view, but make no sense in chemical. So structure validation is something that is usually overlooked or not prioritized highly. Some of the chemical validations are atom validation – checking that atom is legal chemical atom, its charges and valences. That stereo is defined. Synonym validation is very useful for spotting records that are inconsistent and worth pointing depositor to look at them. Often during data export/import synonyms and/or structure are being manipulated and relationships between them can become faulty. So attempting to verify that synonym and structure actually match something worth doing.SMILES/INCHIs – again relationship between chemical record and depositor’s provided INCHI or SMILES can be faulty. As I’ll show later, this inconsistency could reveal a systematic issue with data set as sometimes INCHi or SMILEs do not match the structure.
  2. The result of processing is a list of records with validation messages in the middle. If record was standardized then “Standardized” column is present with the structure.
  3. Here is the bigger DrugBank dataset we have processed. Some warnings are shown in the dropdown list. Warnings about metals, stereo, enol presence, etc.