SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
Reading and Writing Molecular File
Formats for Data Exchange of Small
Molecules, Biopolymers and Reactions
Roger Sayle
NextMove Software, Cambridge, UK
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
connection table compression
• One distinguishing feature between computational
chemistry (COMP) and cheminformatics (CINF) is the
representation of hydrogens.
• Typically in organic chemistry, approximately half of
the bonds in a “full” connection table are the bonds
to terminal hydrogen atoms.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
hydrogens in file formats
• Implicit vs. Explicit is also preserved with file I/O.
• SMILES: [H]C([H])([H])[H] vs C (or [CH4]).
• MDL connection tables:
[CH4]
RDKit
1 0 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
M END
$$$$
OpenBabel10021215582D
5 4 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
1 4 1 0 0 0 0
1 5 1 0 0 0 0
M END
$$$$
valence models
• To make life interesting, developers of molecule file
formats that support implicit hydrogens (Daylight
and MDL) allow omission of the number of implicit
hydrogens on an atom, to save space, instead relying
on the deriving the number of implicit hydrogens for
the atom’s default valence in a given environment.
• Both formats can fully specify implicit hydrogen
count/valence, but alas not all readers/writers
support these conventions.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
example valence model
• As an example, Daylight’s valence model for SMILES
states that all aromatic nitrogens have no implicit
hydrogens by default.
• If a hydrogen is present, it should be written as
“[nH]”, and if it’s charged as “[nH+]”.
• The number of hydrogens never needs to be
guessed, and the molecular formula is precisely
specified.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
the peril of Valence models
A popular desktop sketching program is responsible for
the poor reaction yields of traditional alchemists.
Saving the above reaction as an MDL RXN file and
reloading it changes the hydrogen count due to bugs in
the software. In practice, such errors lose the
distinction between sodium metal and sodium hydride
(etc.) in pharmaceutical ELNs.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
the mdl/accelrys valence model
• The correct valence is specified by MDL/ISIS (and as
documented in Accelrys’ “Chemical Representation”)
• Neutral carbon should be four valent.
– Everyone agrees on this (hopefully).
• +1 Nitrogen cation should be four valent.
• +1 Carbon cation should be three valent.
• -4 charge Boron should a have valence 1.
– OEChem says 0, OpenBabel says 3, RDKit says 7...
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
the “mdlbench.sdf” benchmark
• To test the fidelity of MDL valence implementations
in available SD file readers, we evaluate the SMILES
generated from a standard reference test file.
• This SD file contains 10,208 connection tables:
– 114 different elements plus “D” and “T”
– 11 charge states (from -4 to +6 inclusive)
– 8 environments (minimum valences from 0 to 7)
– 116*11*8 = 10208
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
recall/precision and subsets
• Rather than assess programs on the 10K theoretically
possible atomic valences, a more realistic benchmark
is to consider the environments encountered in
practice (such as the 199 found in Accrelyrs’ MDDR
database, version 2011.2).
• An alternative restriction is to consider only the 10
“organic” elements (B,C,N,O,P,S,F,Cl,Br,I).
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
24 mdl mol file readers later...
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
Complete Set (10208) Organic Subset (880) MDDR Observed (199)
Toolkit fails invalid dots stars bizarre wrong good percent fraction wrong good percent fraction wrong good percent fraction
OEChem v1.9 0 0 0 264 0 22 9922 97.20% 99.78% 5 875 99.43% 99.43% 0 199 100.00% 100.00%
MDL Direct v8 968 0 0 0 0 22 9218 90.30% 99.76% 5 875 99.43% 99.43% 0 199 100.00% 100.00%
Daylight v4.8.2 0 0 0 880 0 552 8776 85.97% 94.08% 196 684 77.73% 77.73% 1 198 99.50% 99.50%
Indigo 1.1.4 1917 0 0 880 0 184 7227 70.80% 97.52% 79 383 43.52% 82.90% 1 197 98.99% 99.49%
Pipeline Pilot v9.0 0 0 0 968 0 243 8997 88.14% 97.37% 27 853 96.93% 96.93% 3 196 98.49% 98.49%
Dotmatics PinPoint 0 0 0 968 0 650 8590 84.15% 92.97% 46 834 94.77% 94.77% 3 196 98.49% 98.49%
ChemDraw v12 0 704 0 0 0 548 8956 87.74% 94.23% 53 827 93.98% 93.98% 3 196 98.49% 98.49%
SDFilter 1276 0 0 847 0 462 7623 74.68% 94.29% 125 645 73.30% 83.77% 2 195 97.99% 98.98%
InfoChem [internal] 0 0 0 0 0 829 9379 91.88% 91.88% 285 595 67.61% 67.61% 4 195 97.99% 97.99%
Pipeline Pilot v8.5 0 0 0 968 0 2252 6988 68.46% 75.63% 371 509 57.84% 57.84% 5 194 97.49% 97.49%
ChemAxon 5.10 0 0 0 440 0 684 9084 88.99% 93.00% 248 632 71.82% 71.82% 6 193 96.98% 96.98%
CACTVS 3.407 33 0 6 352 2 576 9239 90.51% 94.13% 196 684 77.73% 77.73% 7 192 96.48% 96.48%
CACTVS 3.383 11 792 6 0 2 576 8821 86.41% 93.87% 196 684 77.73% 77.73% 7 192 96.48% 96.48%
Balloon v1.40 0 0 0 0 0 726 9482 92.89% 92.89% 252 628 71.36% 71.36% 8 191 95.98% 95.98%
AvalonTools v1.1b 0 0 0 0 0 732 9476 92.83% 92.83% 270 610 69.32% 69.32% 8 191 95.98% 95.98%
Schrodinger Canvas 616 0 0 0 44 872 8676 84.99% 90.87% 259 621 70.57% 70.57% 11 188 94.47% 94.47%
OpenBabel 2.3.90 0 0 0 176 66 659 9307 91.17% 93.39% 242 638 72.50% 72.50% 12 187 93.97% 93.97%
MOE 2011.10 0 0 0 0 4221 936 5051 49.48% 84.37% 94 340 38.64% 78.34% 5 186 93.47% 97.38%
CDK 1.4.13 [noh] 264 0 0 0 0 485 9459 92.66% 95.12% 165 715 81.25% 81.25% 14 185 92.96% 92.96%
RDKit +NextMove 2339 0 66 0 0 1143 6660 65.24% 85.35% 285 222 25.23% 43.79% 9 182 91.46% 95.29%
InChI=1S/... 880 0 6293 0 0 141 2894 28.35% 95.35% 85 795 90.34% 90.34% 11 113 56.78% 91.13%
CDK 1.4.13 [imph] 9876 0 0 0 0 80 252 2.47% 75.90% 30 63 7.16% 67.74% 12 98 49.25% 89.09%
RDKit 2012_09 4029 0 66 0 0 4722 1391 13.63% 22.75% 285 222 25.23% 43.79% 100 88 44.22% 46.81%
InfoChem +RDKit 3395 0 66 634 0 4735 1378 13.50% 22.54% 285 222 25.23% 43.79% 100 88 44.22% 46.81%
mdlbench.sdf results @ RDKIT UGM
ToolKit Failures Incorrect Correct Recall Precision
OEChem v1.9 264 22 9922 97.20% 99.78%
CDK v1.4.13 264 486 9458 95.11% 95.11%
OpenBabel v2.3.90 176 668 9364 91.73% 93.34%
CACTVS v3.407 352 511 9339 91.49% 94.81%
MDL Direct v8.0 968 22 9218 90.30% 99.76%
ChemAxon v5.10 440 685 9083 88.98% 92.99%
Pipeline Pilot v9.0 968 243 8997 88.14% 97.37%
ChemDraw v12.0 704 548 8956 87.74% 94.23%
GGA Indigo v1.1.4 2797 184 7227 70.80% 97.52%
MOE v2011.10 4221 936 5051 49.48% 84.37%
RDKit v2012_09 4095 4723 1390 13.62% 22.74%
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
mdlbench.sdf summary
• Although many toolkits can read MDL SD files, they
don’t all perfectly agree on the semantics.
• Different readers, sometimes written by the same
company, can interpret the exact same MOL file as
different molecules.
• Fortunately, most compounds encountered in the
pharmaceutical industry fall into the widely
understood “well-behaved” subset.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
mdlbench.sdf postscript
• Following this experiment, feedback was passed back
to several vendors/developers and the benchmark
including the expected results were published online.
• Patches were submitted to RDKit and OpenBabel to
achieve 100% conformance on this benchmark.
• Many vendors, including ChemAxon, Dotmatics,
Optibrium and Xemistry, and open source projects,
including MayaChem and Balloon, have announced
improvements based on this work.
• As a result the community converges on a standard.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
part 2: further support
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
label preservation (MDL ALIAS)
Can be converted to a molfile as
NextMove07171309292D
5 4 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0
1.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.5000 -0.8660 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.5000 0.8660 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
2.5000 0.8660 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 2 0 0 0 0
2 4 1 0 0 0 0
4 5 1 0 0 0 0
A 1
Alexa555
A 5
LH-RH
M END
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
v3000 mol files
• Accelrys’ introduction of the version 3000 molfile
introduces a chicken and egg problem for the
cheminformatics community.
• Adoption of the format is hampered by the limited
support for v3000 molfiles amongst third-party tools,
but equally the v3000 files in circulation create
problems for developers.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
reactions as 1st class citizens
• Although all toolkits support small molecule
exchange, support for reactions is generally poorer.
• This situation is compounded by treating reactions as
different objects/types than regular molecules, and
sometimes introducing format distinctions based
upon the content of a file.
• Many formats, including MDL mol file (via CPSS), ISIS
Sketch, SMILES, ChemDraw CDX and CDXML merrily
encode both molecules and reactions.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
support for empty reactions
• An interesting class of informatics problems concerns
molecules with no atoms, and reactions with no
reactants and/or products.
• In SDF and RDF file formats, these records can have
associated data, but it is not uncommon for this to
be skipped/lost by many tools.
• A SMILES example might be “>> title”.
• It is not uncommon for chemists to draw nothing
after an arrow to indicate a reaction failure, or
nothing before one to indicate compound purchase.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
agents, solvents and catalysts
An extremely useful extension to the MDL RXN file
format introduced by ChemAxon is support for agents.
After the reactant count and product count fields,
allowing an agent count avoid information loss,
capturing items drawn above/below a reaction arrow.
CSc1ccccc1(F)>OO>CS(=O)c1ccc(cc1)F
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
sketch file support
• The community should tackle better handling of
sketch file formats, such as ISIS Sketch, CDX, CDXML,
Marvin and ACD/Labs sk2.
• The challenge is that these formats are intended to
be consumed by human beings not computers, but
they do capture details otherwise lost in translation.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
decrypting cdx & cDXML files
• CambridgeSoft are to be congratulated for publicly
documenting their CDX and CDXML file formats.
• Unfortunately this online ChemDraw developer
resource is no longer being kept up to date.
• Mistakes: “arrow” is encoded by object 0x8021.
• New tags: object 0x802b encodes “annotation”.
• Proprietary property tags: USPTO’s “PageDefinition”.
• Support for reading and writing isotopic information
in CDXML files has been contributed to Open Babel.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
superatom expansion
becomes
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
chemdraw superatom correction
Chemist drew Chemist Intended Chemist got
DMF
K2CO3
HBTU
mCPBA
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
salt/component grouping
• Keeping track of the intended number and formulae
of reactants, products and agents in a reaction,
requires preserving salt form associations.
• This can be implemented by honoring the “group”
information from the sketch as single disconnected
components in MDL RXN and RD file output.
• These associations are traditionally lost in SMILES...
...>CC(=O)[O-].[Cl-].[Cl-].[Fe].[K+].[Pd+2]...>...
but can be retained via ChemAxon/GGA extensions
to SMILES that appear as “[Na+].[Cl-] |f:0.1| salt”.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
zero order bonds
• Clark[1] has proposed a useful extension to MDL file
format for representing bonds of order zero, “M
ZBO” which solve a number representation issues in
organometallics.
• In this presentation, we propose a novel extension to
SMILES to encode/preserve these: [K]..OC(=O)O..[K]
1. Alex M. Clark, “Accurate Specification of Molecular Structures: The Case for Zero-Order
Bonds and Explicit Hydrogen Counting”, J. Chem. Inf. Model. 51:3149-3157, 2011.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
biomolecules as MDL sgroups
• Peptides, nucleic acids and sugars can be annotated
in a number of ways in molfiles, typically as
abbreviated Sgroups that define leaving groups.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
PEPTIDE1{C.Y.I.Q.N.C.P.L.G.[am]}$PEPTIDE1,PEPTIDE1,1:R3-6:R3$$$
variable attachment points
• Reported oligosaccharide structures often contain
linkages whose exact attachment point is unknown.
• Glc(β1→?)Glc has four possible structures, including:
• Such structures can be round-tripped between IUPAC
line notations, SMILES and MDL v2000 and v3000
molfiles (sometimes using common extensions).
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
variable attachment points
• Conveniently, these formats use the same “dummy
atom” semantics to represent a set of atoms.
• SMILES does not support positional variation, but
ChemAxon use an extension: CCCl.Cl* |m:4:1.0|
• V3000 molfiles from both Accelrys Draw and Marvin
store variable attachments in the bond block
M V30 7 1 8 7 ENDPTS=(3 3 2 1) ATTACH=ANY
• V2000 molfiles from ChemAxon Marvin and ACDLabs
ChemSketch use their own incompatible extensions.
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
acknowledgements
• Plamen Petrov, AstraZeneca, Molndal, Sweden.
• Sorel Muresan, AkzoNobel, Stenungsund, Sweden.
• Richard Hall, Astex Pharmaceuticals, Cambridge, UK.
• Markus Sitzmann, NCI, Washington DC, USA.
• Wolf-Dietrich Ihlenfeldt, Xemistry GmbH, Germany.
• Evan Bolton, PubChem, NCBI, Washington DC, USA.
• Daniel Lowe and Noel O’Boyle, NextMove Software,
Cambridge, UK.
• Thank you for your time. Questions?
246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013

Contenu connexe

Tendances

RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]NextMove Software
 
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...NextMove Software
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChemNextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Automated spelling correction to improve recall rates of name-to-structure to...
Automated spelling correction to improve recall rates of name-to-structure to...Automated spelling correction to improve recall rates of name-to-structure to...
Automated spelling correction to improve recall rates of name-to-structure to...Sorel Muresan
 
Getting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dotsGetting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dotsSorel Muresan
 
Patent Cheminformatics: Identification of key compounds in patents
Patent Cheminformatics: Identification of key compounds in patentsPatent Cheminformatics: Identification of key compounds in patents
Patent Cheminformatics: Identification of key compounds in patentsSorel Muresan
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsNextMove Software
 
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...Kamel Mansouri
 
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Hitesh Patel
 
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACSExtracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACSSimBioSys_Inc
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
 
ACS 2012 Flashtalk
ACS 2012 FlashtalkACS 2012 Flashtalk
ACS 2012 Flashtalkmcbridemj
 
Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Kamel Mansouri
 
Preparation of pyrimido[4,5 b][1,6]naphthyridin-4(1 h)-one derivatives
Preparation of pyrimido[4,5 b][1,6]naphthyridin-4(1 h)-one derivativesPreparation of pyrimido[4,5 b][1,6]naphthyridin-4(1 h)-one derivatives
Preparation of pyrimido[4,5 b][1,6]naphthyridin-4(1 h)-one derivativeselshimaa eid
 

Tendances (20)

RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Automated spelling correction to improve recall rates of name-to-structure to...
Automated spelling correction to improve recall rates of name-to-structure to...Automated spelling correction to improve recall rates of name-to-structure to...
Automated spelling correction to improve recall rates of name-to-structure to...
 
Getting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dotsGetting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dots
 
Patent Cheminformatics: Identification of key compounds in patents
Patent Cheminformatics: Identification of key compounds in patentsPatent Cheminformatics: Identification of key compounds in patents
Patent Cheminformatics: Identification of key compounds in patents
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
 
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
 
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACSExtracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
Extracting Synthetic Knowledge from Reaction Databases - ARChem at the 246th ACS
 
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
 
Ranjini ppr
Ranjini pprRanjini ppr
Ranjini ppr
 
ACS 2012 Flashtalk
ACS 2012 FlashtalkACS 2012 Flashtalk
ACS 2012 Flashtalk
 
Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...
 
Preparation of pyrimido[4,5 b][1,6]naphthyridin-4(1 h)-one derivatives
Preparation of pyrimido[4,5 b][1,6]naphthyridin-4(1 h)-one derivativesPreparation of pyrimido[4,5 b][1,6]naphthyridin-4(1 h)-one derivatives
Preparation of pyrimido[4,5 b][1,6]naphthyridin-4(1 h)-one derivatives
 

Similaire à Reading and Writing Molecular File Formats for Data Exchange of Small Molecules, Biopolymers and Reactions

Cheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveCheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveNextMove Software
 
IEA-GHG project and findings to date
IEA-GHG project and findings to dateIEA-GHG project and findings to date
IEA-GHG project and findings to dateIEA-ETSAP
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in ActionSSA KPI
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Modelon JSME 2016 - Model Based Design for Fuel Cell Systems
Modelon JSME 2016 - Model Based Design for Fuel Cell SystemsModelon JSME 2016 - Model Based Design for Fuel Cell Systems
Modelon JSME 2016 - Model Based Design for Fuel Cell SystemsModelon
 
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...ChemAxon
 
OPTIMIZING BATTERY END-OF-LIFE BY STATIONARY APPLICATIONS AND RECYCLING
OPTIMIZING BATTERY END-OF-LIFE BY STATIONARY APPLICATIONS AND RECYCLINGOPTIMIZING BATTERY END-OF-LIFE BY STATIONARY APPLICATIONS AND RECYCLING
OPTIMIZING BATTERY END-OF-LIFE BY STATIONARY APPLICATIONS AND RECYCLINGiQHub
 
Launching digital biology, 12 May 2015, Bremen
Launching digital biology, 12 May 2015, BremenLaunching digital biology, 12 May 2015, Bremen
Launching digital biology, 12 May 2015, Bremenbioflux
 
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough NutsMVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough Nutsinside-BigData.com
 
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...CARLOS III UNIVERSITY OF MADRID
 
Investigation of Metal and Chemical Hydrides for Hydrogen Storage in Novel Fu...
Investigation of Metal and Chemical Hydrides for Hydrogen Storage in Novel Fu...Investigation of Metal and Chemical Hydrides for Hydrogen Storage in Novel Fu...
Investigation of Metal and Chemical Hydrides for Hydrogen Storage in Novel Fu...chrisrobschu
 
OSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchainOSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchainCARLOS III UNIVERSITY OF MADRID
 
survey_of_matrix_for_simulation
survey_of_matrix_for_simulationsurvey_of_matrix_for_simulation
survey_of_matrix_for_simulationJon Hand
 
CONCEPTUAL DESIGN OF CHEMICAL PROCESSES.pdf
CONCEPTUAL DESIGN OF CHEMICAL PROCESSES.pdfCONCEPTUAL DESIGN OF CHEMICAL PROCESSES.pdf
CONCEPTUAL DESIGN OF CHEMICAL PROCESSES.pdfjohn917616
 
basic-engineering-circuit-analysis-10th-Irwin.pdf
basic-engineering-circuit-analysis-10th-Irwin.pdfbasic-engineering-circuit-analysis-10th-Irwin.pdf
basic-engineering-circuit-analysis-10th-Irwin.pdfAngelGabrielParianGa1
 
Implementing Life Cycle Assessment to Understand the Environmental and Energy...
Implementing Life Cycle Assessment to Understand the Environmental and Energy...Implementing Life Cycle Assessment to Understand the Environmental and Energy...
Implementing Life Cycle Assessment to Understand the Environmental and Energy...barkercarlock
 

Similaire à Reading and Writing Molecular File Formats for Data Exchange of Small Molecules, Biopolymers and Reactions (20)

Cheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveCheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspective
 
Reactive distillation
Reactive distillationReactive distillation
Reactive distillation
 
IEA-GHG project and findings to date
IEA-GHG project and findings to dateIEA-GHG project and findings to date
IEA-GHG project and findings to date
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in Action
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Modelon JSME 2016 - Model Based Design for Fuel Cell Systems
Modelon JSME 2016 - Model Based Design for Fuel Cell SystemsModelon JSME 2016 - Model Based Design for Fuel Cell Systems
Modelon JSME 2016 - Model Based Design for Fuel Cell Systems
 
Hd other
Hd otherHd other
Hd other
 
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...
 
OPTIMIZING BATTERY END-OF-LIFE BY STATIONARY APPLICATIONS AND RECYCLING
OPTIMIZING BATTERY END-OF-LIFE BY STATIONARY APPLICATIONS AND RECYCLINGOPTIMIZING BATTERY END-OF-LIFE BY STATIONARY APPLICATIONS AND RECYCLING
OPTIMIZING BATTERY END-OF-LIFE BY STATIONARY APPLICATIONS AND RECYCLING
 
Manual petsc
Manual petscManual petsc
Manual petsc
 
Launching digital biology, 12 May 2015, Bremen
Launching digital biology, 12 May 2015, BremenLaunching digital biology, 12 May 2015, Bremen
Launching digital biology, 12 May 2015, Bremen
 
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough NutsMVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
 
Advanced Computational Drug Design
Advanced Computational Drug DesignAdvanced Computational Drug Design
Advanced Computational Drug Design
 
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
 
Investigation of Metal and Chemical Hydrides for Hydrogen Storage in Novel Fu...
Investigation of Metal and Chemical Hydrides for Hydrogen Storage in Novel Fu...Investigation of Metal and Chemical Hydrides for Hydrogen Storage in Novel Fu...
Investigation of Metal and Chemical Hydrides for Hydrogen Storage in Novel Fu...
 
OSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchainOSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchain
 
survey_of_matrix_for_simulation
survey_of_matrix_for_simulationsurvey_of_matrix_for_simulation
survey_of_matrix_for_simulation
 
CONCEPTUAL DESIGN OF CHEMICAL PROCESSES.pdf
CONCEPTUAL DESIGN OF CHEMICAL PROCESSES.pdfCONCEPTUAL DESIGN OF CHEMICAL PROCESSES.pdf
CONCEPTUAL DESIGN OF CHEMICAL PROCESSES.pdf
 
basic-engineering-circuit-analysis-10th-Irwin.pdf
basic-engineering-circuit-analysis-10th-Irwin.pdfbasic-engineering-circuit-analysis-10th-Irwin.pdf
basic-engineering-circuit-analysis-10th-Irwin.pdf
 
Implementing Life Cycle Assessment to Understand the Environmental and Energy...
Implementing Life Cycle Assessment to Understand the Environmental and Energy...Implementing Life Cycle Assessment to Understand the Environmental and Energy...
Implementing Life Cycle Assessment to Understand the Environmental and Energy...
 

Plus de NextMove Software

Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulNextMove Software
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)NextMove Software
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightNextMove Software
 
Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?NextMove Software
 
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...NextMove Software
 

Plus de NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be useful
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sight
 
Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?
 
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
 

Dernier

How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 

Dernier (20)

How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 

Reading and Writing Molecular File Formats for Data Exchange of Small Molecules, Biopolymers and Reactions

  • 1. Reading and Writing Molecular File Formats for Data Exchange of Small Molecules, Biopolymers and Reactions Roger Sayle NextMove Software, Cambridge, UK 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 2. connection table compression • One distinguishing feature between computational chemistry (COMP) and cheminformatics (CINF) is the representation of hydrogens. • Typically in organic chemistry, approximately half of the bonds in a “full” connection table are the bonds to terminal hydrogen atoms. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 3. hydrogens in file formats • Implicit vs. Explicit is also preserved with file I/O. • SMILES: [H]C([H])([H])[H] vs C (or [CH4]). • MDL connection tables: [CH4] RDKit 1 0 0 0 0 0 0 0 0 0999 V2000 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 M END $$$$ OpenBabel10021215582D 5 4 0 0 0 0 0 0 0 0999 V2000 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 1 3 1 0 0 0 0 1 4 1 0 0 0 0 1 5 1 0 0 0 0 M END $$$$
  • 4. valence models • To make life interesting, developers of molecule file formats that support implicit hydrogens (Daylight and MDL) allow omission of the number of implicit hydrogens on an atom, to save space, instead relying on the deriving the number of implicit hydrogens for the atom’s default valence in a given environment. • Both formats can fully specify implicit hydrogen count/valence, but alas not all readers/writers support these conventions. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 5. example valence model • As an example, Daylight’s valence model for SMILES states that all aromatic nitrogens have no implicit hydrogens by default. • If a hydrogen is present, it should be written as “[nH]”, and if it’s charged as “[nH+]”. • The number of hydrogens never needs to be guessed, and the molecular formula is precisely specified. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 6. the peril of Valence models A popular desktop sketching program is responsible for the poor reaction yields of traditional alchemists. Saving the above reaction as an MDL RXN file and reloading it changes the hydrogen count due to bugs in the software. In practice, such errors lose the distinction between sodium metal and sodium hydride (etc.) in pharmaceutical ELNs. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 7. the mdl/accelrys valence model • The correct valence is specified by MDL/ISIS (and as documented in Accelrys’ “Chemical Representation”) • Neutral carbon should be four valent. – Everyone agrees on this (hopefully). • +1 Nitrogen cation should be four valent. • +1 Carbon cation should be three valent. • -4 charge Boron should a have valence 1. – OEChem says 0, OpenBabel says 3, RDKit says 7... 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 8. the “mdlbench.sdf” benchmark • To test the fidelity of MDL valence implementations in available SD file readers, we evaluate the SMILES generated from a standard reference test file. • This SD file contains 10,208 connection tables: – 114 different elements plus “D” and “T” – 11 charge states (from -4 to +6 inclusive) – 8 environments (minimum valences from 0 to 7) – 116*11*8 = 10208 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 9. recall/precision and subsets • Rather than assess programs on the 10K theoretically possible atomic valences, a more realistic benchmark is to consider the environments encountered in practice (such as the 199 found in Accrelyrs’ MDDR database, version 2011.2). • An alternative restriction is to consider only the 10 “organic” elements (B,C,N,O,P,S,F,Cl,Br,I). 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 10. 24 mdl mol file readers later... 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013 Complete Set (10208) Organic Subset (880) MDDR Observed (199) Toolkit fails invalid dots stars bizarre wrong good percent fraction wrong good percent fraction wrong good percent fraction OEChem v1.9 0 0 0 264 0 22 9922 97.20% 99.78% 5 875 99.43% 99.43% 0 199 100.00% 100.00% MDL Direct v8 968 0 0 0 0 22 9218 90.30% 99.76% 5 875 99.43% 99.43% 0 199 100.00% 100.00% Daylight v4.8.2 0 0 0 880 0 552 8776 85.97% 94.08% 196 684 77.73% 77.73% 1 198 99.50% 99.50% Indigo 1.1.4 1917 0 0 880 0 184 7227 70.80% 97.52% 79 383 43.52% 82.90% 1 197 98.99% 99.49% Pipeline Pilot v9.0 0 0 0 968 0 243 8997 88.14% 97.37% 27 853 96.93% 96.93% 3 196 98.49% 98.49% Dotmatics PinPoint 0 0 0 968 0 650 8590 84.15% 92.97% 46 834 94.77% 94.77% 3 196 98.49% 98.49% ChemDraw v12 0 704 0 0 0 548 8956 87.74% 94.23% 53 827 93.98% 93.98% 3 196 98.49% 98.49% SDFilter 1276 0 0 847 0 462 7623 74.68% 94.29% 125 645 73.30% 83.77% 2 195 97.99% 98.98% InfoChem [internal] 0 0 0 0 0 829 9379 91.88% 91.88% 285 595 67.61% 67.61% 4 195 97.99% 97.99% Pipeline Pilot v8.5 0 0 0 968 0 2252 6988 68.46% 75.63% 371 509 57.84% 57.84% 5 194 97.49% 97.49% ChemAxon 5.10 0 0 0 440 0 684 9084 88.99% 93.00% 248 632 71.82% 71.82% 6 193 96.98% 96.98% CACTVS 3.407 33 0 6 352 2 576 9239 90.51% 94.13% 196 684 77.73% 77.73% 7 192 96.48% 96.48% CACTVS 3.383 11 792 6 0 2 576 8821 86.41% 93.87% 196 684 77.73% 77.73% 7 192 96.48% 96.48% Balloon v1.40 0 0 0 0 0 726 9482 92.89% 92.89% 252 628 71.36% 71.36% 8 191 95.98% 95.98% AvalonTools v1.1b 0 0 0 0 0 732 9476 92.83% 92.83% 270 610 69.32% 69.32% 8 191 95.98% 95.98% Schrodinger Canvas 616 0 0 0 44 872 8676 84.99% 90.87% 259 621 70.57% 70.57% 11 188 94.47% 94.47% OpenBabel 2.3.90 0 0 0 176 66 659 9307 91.17% 93.39% 242 638 72.50% 72.50% 12 187 93.97% 93.97% MOE 2011.10 0 0 0 0 4221 936 5051 49.48% 84.37% 94 340 38.64% 78.34% 5 186 93.47% 97.38% CDK 1.4.13 [noh] 264 0 0 0 0 485 9459 92.66% 95.12% 165 715 81.25% 81.25% 14 185 92.96% 92.96% RDKit +NextMove 2339 0 66 0 0 1143 6660 65.24% 85.35% 285 222 25.23% 43.79% 9 182 91.46% 95.29% InChI=1S/... 880 0 6293 0 0 141 2894 28.35% 95.35% 85 795 90.34% 90.34% 11 113 56.78% 91.13% CDK 1.4.13 [imph] 9876 0 0 0 0 80 252 2.47% 75.90% 30 63 7.16% 67.74% 12 98 49.25% 89.09% RDKit 2012_09 4029 0 66 0 0 4722 1391 13.63% 22.75% 285 222 25.23% 43.79% 100 88 44.22% 46.81% InfoChem +RDKit 3395 0 66 634 0 4735 1378 13.50% 22.54% 285 222 25.23% 43.79% 100 88 44.22% 46.81%
  • 11. mdlbench.sdf results @ RDKIT UGM ToolKit Failures Incorrect Correct Recall Precision OEChem v1.9 264 22 9922 97.20% 99.78% CDK v1.4.13 264 486 9458 95.11% 95.11% OpenBabel v2.3.90 176 668 9364 91.73% 93.34% CACTVS v3.407 352 511 9339 91.49% 94.81% MDL Direct v8.0 968 22 9218 90.30% 99.76% ChemAxon v5.10 440 685 9083 88.98% 92.99% Pipeline Pilot v9.0 968 243 8997 88.14% 97.37% ChemDraw v12.0 704 548 8956 87.74% 94.23% GGA Indigo v1.1.4 2797 184 7227 70.80% 97.52% MOE v2011.10 4221 936 5051 49.48% 84.37% RDKit v2012_09 4095 4723 1390 13.62% 22.74% 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 12. mdlbench.sdf summary • Although many toolkits can read MDL SD files, they don’t all perfectly agree on the semantics. • Different readers, sometimes written by the same company, can interpret the exact same MOL file as different molecules. • Fortunately, most compounds encountered in the pharmaceutical industry fall into the widely understood “well-behaved” subset. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 13. mdlbench.sdf postscript • Following this experiment, feedback was passed back to several vendors/developers and the benchmark including the expected results were published online. • Patches were submitted to RDKit and OpenBabel to achieve 100% conformance on this benchmark. • Many vendors, including ChemAxon, Dotmatics, Optibrium and Xemistry, and open source projects, including MayaChem and Balloon, have announced improvements based on this work. • As a result the community converges on a standard. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 14. part 2: further support 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 15. label preservation (MDL ALIAS) Can be converted to a molfile as NextMove07171309292D 5 4 0 0 0 0 0 0 0999 V2000 0.0000 0.0000 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0 1.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.5000 -0.8660 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1.5000 0.8660 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 2.5000 0.8660 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 2 3 2 0 0 0 0 2 4 1 0 0 0 0 4 5 1 0 0 0 0 A 1 Alexa555 A 5 LH-RH M END 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 16. v3000 mol files • Accelrys’ introduction of the version 3000 molfile introduces a chicken and egg problem for the cheminformatics community. • Adoption of the format is hampered by the limited support for v3000 molfiles amongst third-party tools, but equally the v3000 files in circulation create problems for developers. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 17. reactions as 1st class citizens • Although all toolkits support small molecule exchange, support for reactions is generally poorer. • This situation is compounded by treating reactions as different objects/types than regular molecules, and sometimes introducing format distinctions based upon the content of a file. • Many formats, including MDL mol file (via CPSS), ISIS Sketch, SMILES, ChemDraw CDX and CDXML merrily encode both molecules and reactions. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 18. support for empty reactions • An interesting class of informatics problems concerns molecules with no atoms, and reactions with no reactants and/or products. • In SDF and RDF file formats, these records can have associated data, but it is not uncommon for this to be skipped/lost by many tools. • A SMILES example might be “>> title”. • It is not uncommon for chemists to draw nothing after an arrow to indicate a reaction failure, or nothing before one to indicate compound purchase. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 19. agents, solvents and catalysts An extremely useful extension to the MDL RXN file format introduced by ChemAxon is support for agents. After the reactant count and product count fields, allowing an agent count avoid information loss, capturing items drawn above/below a reaction arrow. CSc1ccccc1(F)>OO>CS(=O)c1ccc(cc1)F 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 20. sketch file support • The community should tackle better handling of sketch file formats, such as ISIS Sketch, CDX, CDXML, Marvin and ACD/Labs sk2. • The challenge is that these formats are intended to be consumed by human beings not computers, but they do capture details otherwise lost in translation. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 21. decrypting cdx & cDXML files • CambridgeSoft are to be congratulated for publicly documenting their CDX and CDXML file formats. • Unfortunately this online ChemDraw developer resource is no longer being kept up to date. • Mistakes: “arrow” is encoded by object 0x8021. • New tags: object 0x802b encodes “annotation”. • Proprietary property tags: USPTO’s “PageDefinition”. • Support for reading and writing isotopic information in CDXML files has been contributed to Open Babel. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 22. superatom expansion becomes 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 23. chemdraw superatom correction Chemist drew Chemist Intended Chemist got DMF K2CO3 HBTU mCPBA 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 24. salt/component grouping • Keeping track of the intended number and formulae of reactants, products and agents in a reaction, requires preserving salt form associations. • This can be implemented by honoring the “group” information from the sketch as single disconnected components in MDL RXN and RD file output. • These associations are traditionally lost in SMILES... ...>CC(=O)[O-].[Cl-].[Cl-].[Fe].[K+].[Pd+2]...>... but can be retained via ChemAxon/GGA extensions to SMILES that appear as “[Na+].[Cl-] |f:0.1| salt”. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 25. zero order bonds • Clark[1] has proposed a useful extension to MDL file format for representing bonds of order zero, “M ZBO” which solve a number representation issues in organometallics. • In this presentation, we propose a novel extension to SMILES to encode/preserve these: [K]..OC(=O)O..[K] 1. Alex M. Clark, “Accurate Specification of Molecular Structures: The Case for Zero-Order Bonds and Explicit Hydrogen Counting”, J. Chem. Inf. Model. 51:3149-3157, 2011. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 26. biomolecules as MDL sgroups • Peptides, nucleic acids and sugars can be annotated in a number of ways in molfiles, typically as abbreviated Sgroups that define leaving groups. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013 PEPTIDE1{C.Y.I.Q.N.C.P.L.G.[am]}$PEPTIDE1,PEPTIDE1,1:R3-6:R3$$$
  • 27. variable attachment points • Reported oligosaccharide structures often contain linkages whose exact attachment point is unknown. • Glc(β1→?)Glc has four possible structures, including: • Such structures can be round-tripped between IUPAC line notations, SMILES and MDL v2000 and v3000 molfiles (sometimes using common extensions). 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 28. variable attachment points • Conveniently, these formats use the same “dummy atom” semantics to represent a set of atoms. • SMILES does not support positional variation, but ChemAxon use an extension: CCCl.Cl* |m:4:1.0| • V3000 molfiles from both Accelrys Draw and Marvin store variable attachments in the bond block M V30 7 1 8 7 ENDPTS=(3 3 2 1) ATTACH=ANY • V2000 molfiles from ChemAxon Marvin and ACDLabs ChemSketch use their own incompatible extensions. 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013
  • 29. acknowledgements • Plamen Petrov, AstraZeneca, Molndal, Sweden. • Sorel Muresan, AkzoNobel, Stenungsund, Sweden. • Richard Hall, Astex Pharmaceuticals, Cambridge, UK. • Markus Sitzmann, NCI, Washington DC, USA. • Wolf-Dietrich Ihlenfeldt, Xemistry GmbH, Germany. • Evan Bolton, PubChem, NCBI, Washington DC, USA. • Daniel Lowe and Noel O’Boyle, NextMove Software, Cambridge, UK. • Thank you for your time. Questions? 246th ACS National Meeting, Indianapolis, IN, Wednesday 11th September 2013