The document summarizes the Chemical Validation and Standardization Platform (CVSP) used by Open PHACTS to validate and standardize chemical structure data from various sources. CVSP performs validation of chemical structures, generates standardized representations, and establishes parent-child relationships between structures. It has validated over 1.3 million records from ChEMBL and over 6,500 from DrugBank, identifying various issues. Standardized structures and relationships are exported in RDF/turtle format to integrate with the Open PHACTS semantic web platform.
Scaling API-first – The story of a global engineering organization
Acs 2013 indianapolis_cvsp
1. Karen Karapetyan, Colin Batchelor,
Jonathan Steele , David Sharpe
Valery Tkachenko, Antony Williams
Building support for the semantic web
for chemistry
at the Royal Society of Chemistry
2.
3. http://www.openphacts.org
Open PHACTS is an Innovative Medicines
Initiative (IMI) project, aiming to reduce the
barriers to drug discovery in industry, academia
and for small businesses.
Semantic web is one of the corner stones
9. • ChemSpider (passed 100K records)
• All records are planned to pass through CVSP
• DrugBank (~6.5K records)
• ChEMBL (~1.2 mln records)
Data set examples
13. 2 records where Smiles, InChI, and name did not match
the structure
DB00611 DB01547
14. ~40 records where InChIs did not match the structure
DrugBank ID: DB00755
InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-
20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-
14+
DruGBank ID: DB00614
15. DB08128
J. Brechner, IUPAC
Graphical Representation of
stereochem. configurations
Section: ST-1.1.10
DB06287
7 records with 2 stereo bonds at chiral
atoms
16. CVSP validation of ChEMBL 16 (~1.3 mln. records)
• Overall 0.7% of records had validation issues
• Stereo problems (~82%)
• Directions of bonds do not make sense (~63%)
• Ambiguous stereo : 2 stereo bonds at chiral center (~19%)
20. “atom not recognized” – 3% isotopes
Should be atom from periodic table
No mass difference in atom line
No “M ISO” in connection table
In molfile:
21. CVSP : standardization
• Standardization workflow was developed for
Open PHACTS‟s registration system
• Workflow includes modules like
• SMIRKS rules derived from FDA SRS manual
• Resetting symmetric stereo
• Dearomatize
• Layout
• Fix “fixable” stereo issues
• Disconnect all metals from N, O, F
• Fold non-stereo hydrogens
• Handle partial ionization of acid-base
• etc
22. Open PHACTS chemical registry system:
what we use as chemical identity?
• Standard InChI/InChIKey (currently used ChemSpider)
• Absolute smiles (isomeric canonical)
Drawbacks
• SMILES –many flavors
• Standard InChI
• does not include unknown/undefined stereo unless at least one defined stereo is present
• does not distinguish between undefined and unknown stereo (always “?”)
• standard InChI does some basic tautomer canonicalization which we wanted to prevent
to distinguish between all tautomers (sometimes useful for linking spectral data to
specific tautomer)
• assumes absolute stereo or no stereo at all
Path we took:
Non-standard InChI with options: SUU SLUUD FixedH SUCF
• Always include unknown/undefined stereo („u‟,‟?‟)
• add Fixed H layer (to distinguish between tautomers)
• Uses chiral flag in MOL/SD record (ON – absolute stereo, OFF-
relative)
23. For each Compound (CSID) parent generation is
attempted
“Tautomerism in large databases”, Sitzmann and
others, J.Comput Aided Mol Des (2010)
Parent Description
Charge-Unsensitive An attempt is made to neutralize ionized acids
and bases. Envisioned to be an ongoing
improvement while new cases appear.
Isotope-Unsensitive Isotopes replaced by common weight
Stereo-Unsensitive Stereo is stripped
Tautomer-Unsensitive Tautomer canonicalization is attempting to
generate a “reasonable” tautomer
Super-Unsensitive This parent is all of the above
No fragment unsensitive parent – we treat all fragments as equal entities
25. Chemistry Validation and Standardization Platform (CVSP)
at cvsp.chemspider.com
• Validation
• Standardization
• Parent generation
RDF Export
Data
26. Data is being imported
from ChemSpider to
Open PHACTS in
RDF/turtle
27. RDF/VoID
– VoID is an RDF Schema vocabulary for expressing metadata
about RDF datasets. It is intended as a bridge between the
publishers and users of RDF data. http://www.w3.org/TR/void
• skos:exactMatch (Simple Knowledge Organisation System)
E.g. To link compounds in OPS with compounds in ChEBI.
• skos:closeMatch
E.g. To link Stereo Insensitive Parents to their Children within OPS.
• skos:relatedMatch
E.g. To link Parent compounds that contain others as Fragments.
– Recommendations on how to create the VoID have been specified by
Manchester here: http://www.cs.man.ac.uk/~graya/ops/2012/ED-
datadesc/
30. Future work
Enabling full semantic web capabilities:
• Establishing RDF server with all relationships
(including parent-child relationships)
• Develop SPARQL capability for querying RDF
Validate all records in ChemSpider by passing it
through CVSP