Boost Fertility New Invention Ups Success Rates.pdf
Standardization and Generation of Parents for Open PHACTS Chemical Registry System
1. Standardization and Generation of Parents
for
Open PHACTS Chemical Registry System
Karen Karapetyan, Valery Tkachenko
Colin Batchelor, Antony Williams
5. Standardization – Organometallics/Salts
Always disconnect N, O, and F from metals:
Disconnect nonmetals (except N,O,F) with transition metals (except Hg)
Ionize free metal with carboxylic acid (Metals of Group I and II)
6. Standardization SMIRKS
(based on InChI normalization and on FDA SRS)
Examples of InChI normalization
[*;H+:1]>>[*;H:1]
[O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3]
>>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]
[N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2]
Examples of FDA SRS rules
[n:1]=[O:2]>>[n+:1][O-:2]
[*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]
[N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]
Thiopurine
[H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[
H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3]
1=[S:2]
10. For each Compound parent generation is attempted
“Tautomerism in large databases”, Sitzmann and others,
J.Comput Aided Mol Des (2010)
Parent Description RDF
Charge-Unsensitive An attempt is made to neutralize ionized acids
and bases. Envisioned to be an ongoing
improvement while new cases appear.
void:linkPredicate skos:closeMatch
dul:expresses cheminf:CHEMINF_000460;
Isotope-Unsensitive Isotopes replaced by common weight void:linkPredicate skos:closeMatch;
dul:expresses cheminf:CHEMINF_000459
Stereo-Unsensitive SP3 and double bond stereo removed void:linkPredicate skos:closeMatch
cheminf:CHEMINF_000456
Tautomer-
Unsensitive
Tautomer canonicalization is attempting to
generate a canonical tautomer
void:linkPredicate skos:closeMatch;
dul:expresses cheminf:CHEMINF_000486;
Super Parent Super parent is generated by applying
modifications of all of the above
void:linkPredicate skos:broadMatch;
dul:expresses cheminf:CHEMINF_000458;
13. What do we use as chemical identity of the standardized records
(primary compound key)?
• Standard InChI/InChIKey (currently used ChemSpider)
• Absolute smiles (isomeric canonical)
Drawbacks
• SMILES – can be too long; no accepted standard; needs to be hashed
• Standard InChI
• does not distinguish between undefined and unknown stereo
• by default standard InChI does some basic tautomer canonicalization
(not needed in new model)
• By default assumes absolute stereo
Proposed Solution
Non-standard InChI with options: SUU SLUUD FixedH SUCF
• much more sensitive to stereo description
• Fixes mobile hydrogens (so tautomers could be distinguished)
• Handles “AND-ed” relative stereo
I would like to start by defining what a “quality” record means, because that is what validation part of the CVSP is about. The chemical record has several aspect to its quality. One that is easiest to check is file format correctness. Each file format has its own formatting rules that record in that format needs to follow. This type of file validation is done by all the database maintainers that have deposition systems.Another, more relevant, type of validation is the chemical validation. A record can be perfectly formatted from file format point of view, but make no sense in chemical. So structure validation is something that is usually overlooked or not prioritized highly. Some of the chemical validations are atom validation – checking that atom is legal chemical atom, its charges and valences. That stereo is defined. Synonym validation is very useful for spotting records that are inconsistent and worth pointing depositor to look at them. Often during data export/import synonyms and/or structure are being manipulated and relationships between them can become faulty. So attempting to verify that synonym and structure actually match something worth doing.SMILES/INCHIs – again relationship between chemical record and depositor’s provided INCHI or SMILES can be faulty. As I’ll show later, this inconsistency could reveal a systematic issue with data set as sometimes INCHi or SMILEs do not match the structure.
The result of processing is a list of records with validation messages in the middle. If record was standardized then “Standardized” column is present with the structure.
Here is the bigger DrugBank dataset we have processed. Some warnings are shown in the dropdown list. Warnings about metals, stereo, enol presence, etc.