Modelling Proteins By Computational Structural Biology
Poster wellcome-trust-2013-herhaus-fuzzy-mcs
1. Improving Cluster Representations by "Fuzzifying"
Maximum Common Substructures
Christian Herhaus
Merck KGaA, Computational Chemistry, Frankfurter Str. 250, 64293 Darmstadt, christian.herhaus@merckgroup.com
Introduction
Results
Arranging similar structures in clusters is one of the typical tasks of modern
Chemoinformatics with high impact in HTS follow-up, generation of structure activity
relationships (SAR) and selection of starting points for compound optimization.
The approach presented here allows the generation of MCS also for similarity-based
clusters with a given inherent structural diversity. It does so by utilizing query features like
atoms lists and query bonds to add "fuzziness" in atom and bond information to the MCS.
Methods for cluster generation are as diverse as the structures which they are applied to
[1], may they be e.g. similarity- or substructure-based. Typically, medicinal chemists tend
to orientate themselves in structure subsets like clusters with the help of substructures,
so-called "scaffolds", which intuitively characterize the structural relationships between
the molecules of the subset.
As a result, the generated "fuzzy" MCS, although still being fully database-searchable, is
more meaningful for the characterisation of clusters as it can cover larger parts of the full
structures than a conventional MCS could do. The approach was implemented in Pipeline
Pilot™ for proof of concept but is general enough to be transferred to other technical
platforms as well.
In the case of substructure-based clustering, well established methods are existing for
generation of Maximum Common Substructures (MCS) which are present in all members
of the structure population or a defined proportion thereof [2]. But in the case of similaritybased clusters, such MCS may either not be existing for the required dataset proportion
or the common substructure may be so small that it is no longer representative and
therefore lacking information.
Conventional
MCS
Aim of this approach is to provide descriptive scaffold structures also for clusters whose
intrinsic structural diversity is too high to be characterized by conventional Maximum
Common Substructure approaches.
"Fuzzy"
MCS
Methods
any
s/d
s/d
s/d
Content of the Pipeline Pilot™ component "Generate Fuzzy MCS" (slightly simplified)
1. Reduced graphs
2. Generalized MCS
any
any
any
any
any
any any
any
any
any any
any
any any
any
any any
any any
any
any
any
any
any
any
any
any any
any
any
6. Query bonds coded as
stereo bonds
7. Final "fuzzy" MCS
any
s/d
s/d
3. Generalized MCS mapped onto structures
4. Structures reduced & mappings normalized
6
1
1
1
17
11
11
11
9
7
1
13
11
16
5. Collected atom/bond information transfered into queries
Atom No
1
7
11
Elements
O,O,S
N,N,O
C,C,C
Atom No
1
7
11
Element
[O,S]
[N,O]
C
Bond No
3
9
14
Types
1,1,1
1,2,1
2,1,4
Bond No
3
9
14
Type
single
single/double
aromatic
7
7
20
s/d
7
Where to get / How to contribute
The current version of the component is published on the Accelrys® Pipeline
Pilot™ user forum and can be downloaded from
https://community.accelrys.com/message/16528.
Contributors are appreciated for improving the approach and for transfering
the concept to other technology platforms like RDKit, CDK or KNIME.
Current limitations
• Aromaticity may cause unexpected effects (e.g. single/double vs. aromatic bonds)
• Symmetric substructures may blur atom/bond information (e.g. carboxyl or nitro groups)
• Currently, just one presence of an atom/bond feature leads to inclusion into a query
feature. For larger clusters, a percentual threshold may be required
• Different ring sizes and chain lengths can so far not be described by query patterns.
[1] Downs GM, Barnard JM: Clustering Methods and Their Uses in Computational Chemistry. In Reviews in Computational Chemistry (18). Chichester: Wiley and Sons; 2003, 1-40.
[2] Ehrlich HC, Rarey M: Maximum common subgraph isomorphism algorithms and their applications in molecular science: a review. WIREs Comput Mol Sci 2011, 1:68-79.