SlideShare une entreprise Scribd logo
1  sur  1
Télécharger pour lire hors ligne
Improving Cluster Representations by "Fuzzifying"
Maximum Common Substructures
Christian Herhaus
Merck KGaA, Computational Chemistry, Frankfurter Str. 250, 64293 Darmstadt, christian.herhaus@merckgroup.com

Introduction

Results

Arranging similar structures in clusters is one of the typical tasks of modern
Chemoinformatics with high impact in HTS follow-up, generation of structure activity
relationships (SAR) and selection of starting points for compound optimization.

The approach presented here allows the generation of MCS also for similarity-based
clusters with a given inherent structural diversity. It does so by utilizing query features like
atoms lists and query bonds to add "fuzziness" in atom and bond information to the MCS.

Methods for cluster generation are as diverse as the structures which they are applied to
[1], may they be e.g. similarity- or substructure-based. Typically, medicinal chemists tend
to orientate themselves in structure subsets like clusters with the help of substructures,
so-called "scaffolds", which intuitively characterize the structural relationships between
the molecules of the subset.

As a result, the generated "fuzzy" MCS, although still being fully database-searchable, is
more meaningful for the characterisation of clusters as it can cover larger parts of the full
structures than a conventional MCS could do. The approach was implemented in Pipeline
Pilot™ for proof of concept but is general enough to be transferred to other technical
platforms as well.

In the case of substructure-based clustering, well established methods are existing for
generation of Maximum Common Substructures (MCS) which are present in all members
of the structure population or a defined proportion thereof [2]. But in the case of similaritybased clusters, such MCS may either not be existing for the required dataset proportion
or the common substructure may be so small that it is no longer representative and
therefore lacking information.

Conventional
MCS

Aim of this approach is to provide descriptive scaffold structures also for clusters whose
intrinsic structural diversity is too high to be characterized by conventional Maximum
Common Substructure approaches.
"Fuzzy"
MCS

Methods

any
s/d
s/d

s/d

Content of the Pipeline Pilot™ component "Generate Fuzzy MCS" (slightly simplified)

1. Reduced graphs

2. Generalized MCS
any

any

any

any

any

any any

any

any
any any

any
any any

any

any any
any any

any
any

any

any
any

any
any
any any

any

any

6. Query bonds coded as
stereo bonds

7. Final "fuzzy" MCS

any
s/d
s/d

3. Generalized MCS mapped onto structures

4. Structures reduced & mappings normalized

6

1

1

1
17

11

11

11
9
7

1

13

11

16

5. Collected atom/bond information transfered into queries
Atom No
1
7
11

Elements
O,O,S
N,N,O
C,C,C

Atom No
1
7
11

Element
[O,S]
[N,O]
C

Bond No
3
9
14

Types
1,1,1
1,2,1
2,1,4

Bond No
3
9
14

Type
single
single/double
aromatic

7

7
20

s/d

7

Where to get / How to contribute
The current version of the component is published on the Accelrys® Pipeline
Pilot™ user forum and can be downloaded from
https://community.accelrys.com/message/16528.
Contributors are appreciated for improving the approach and for transfering
the concept to other technology platforms like RDKit, CDK or KNIME.

Current limitations
• Aromaticity may cause unexpected effects (e.g. single/double vs. aromatic bonds)
• Symmetric substructures may blur atom/bond information (e.g. carboxyl or nitro groups)
• Currently, just one presence of an atom/bond feature leads to inclusion into a query
feature. For larger clusters, a percentual threshold may be required
• Different ring sizes and chain lengths can so far not be described by query patterns.

[1] Downs GM, Barnard JM: Clustering Methods and Their Uses in Computational Chemistry. In Reviews in Computational Chemistry (18). Chichester: Wiley and Sons; 2003, 1-40.
[2] Ehrlich HC, Rarey M: Maximum common subgraph isomorphism algorithms and their applications in molecular science: a review. WIREs Comput Mol Sci 2011, 1:68-79.

Contenu connexe

Tendances

Multiple Sequence Alignment
Multiple Sequence AlignmentMultiple Sequence Alignment
Multiple Sequence AlignmentMeghaj Mallick
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)AnkitTiwari354
 
Propagation of data fusion
Propagation of data fusionPropagation of data fusion
Propagation of data fusionieeepondy
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignmentAfra Fathima
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENTMariya Raju
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence AlignmentRavi Gandham
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partiiSumatiHajela
 
AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...
AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...
AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...Nexgen Technology
 
An Implementation, Empirical Evaluation and Proposed Improvement for Bidirect...
An Implementation, Empirical Evaluation and Proposed Improvement for Bidirect...An Implementation, Empirical Evaluation and Proposed Improvement for Bidirect...
An Implementation, Empirical Evaluation and Proposed Improvement for Bidirect...gerogepatton
 
Making the most of maximum common substructure search
Making the most of maximum common substructure searchMaking the most of maximum common substructure search
Making the most of maximum common substructure searchpenglert
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsNikesh Narayanan
 

Tendances (20)

Multiple Sequence Alignment
Multiple Sequence AlignmentMultiple Sequence Alignment
Multiple Sequence Alignment
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)
 
Parwati sihag
Parwati sihagParwati sihag
Parwati sihag
 
Propagation of data fusion
Propagation of data fusionPropagation of data fusion
Propagation of data fusion
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
Jm200026b
Jm200026bJm200026b
Jm200026b
 
dot plot analysis
dot plot analysisdot plot analysis
dot plot analysis
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 
ISBI_poster
ISBI_posterISBI_poster
ISBI_poster
 
AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...
AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...
AUTOMATIC FACE NAMING BY LEARNING DISCRIMINATIVE AFFINITY MATRICES FROM WEAKL...
 
An Implementation, Empirical Evaluation and Proposed Improvement for Bidirect...
An Implementation, Empirical Evaluation and Proposed Improvement for Bidirect...An Implementation, Empirical Evaluation and Proposed Improvement for Bidirect...
An Implementation, Empirical Evaluation and Proposed Improvement for Bidirect...
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Making the most of maximum common substructure search
Making the most of maximum common substructure searchMaking the most of maximum common substructure search
Making the most of maximum common substructure search
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 

Similaire à Poster wellcome-trust-2013-herhaus-fuzzy-mcs

Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Melissa Moody
 
Juha vesanto esa alhoniemi 2000:clustering of the som
Juha vesanto esa alhoniemi 2000:clustering of the somJuha vesanto esa alhoniemi 2000:clustering of the som
Juha vesanto esa alhoniemi 2000:clustering of the somArchiLab 7
 
Max stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemMax stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemnooriasukmaningtyas
 
Cheminformatics: An overview
Cheminformatics: An overviewCheminformatics: An overview
Cheminformatics: An overviewsubhasis banerjee
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...IJCSIS Research Publications
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure predictionSamvartika Majumdar
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...journal ijrtem
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...IJRTEMJOURNAL
 
Towards a Query Rewriting Algorithm Over Proteomics XML Resources
Towards a Query Rewriting Algorithm Over Proteomics XML ResourcesTowards a Query Rewriting Algorithm Over Proteomics XML Resources
Towards a Query Rewriting Algorithm Over Proteomics XML ResourcesCSCJournals
 
Computational Analysis with ICM
Computational Analysis with ICMComputational Analysis with ICM
Computational Analysis with ICMVernon D Dutch Jr
 
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...ijsrd.com
 
SDC: A Distributed Clustering Protocol
SDC: A Distributed Clustering ProtocolSDC: A Distributed Clustering Protocol
SDC: A Distributed Clustering ProtocolCSCJournals
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES cscpconf
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIESENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIEScsandit
 
Enhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiesEnhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiescsandit
 
Modelling Proteins By Computational Structural Biology
Modelling Proteins By Computational Structural BiologyModelling Proteins By Computational Structural Biology
Modelling Proteins By Computational Structural BiologyAntonio E. Serrano
 

Similaire à Poster wellcome-trust-2013-herhaus-fuzzy-mcs (20)

Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
 
Juha vesanto esa alhoniemi 2000:clustering of the som
Juha vesanto esa alhoniemi 2000:clustering of the somJuha vesanto esa alhoniemi 2000:clustering of the som
Juha vesanto esa alhoniemi 2000:clustering of the som
 
Protein Threading
Protein ThreadingProtein Threading
Protein Threading
 
Max stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemMax stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problem
 
Cheminformatics: An overview
Cheminformatics: An overviewCheminformatics: An overview
Cheminformatics: An overview
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
 
Cray HPC + D + A = HPDA
Cray HPC + D + A = HPDACray HPC + D + A = HPDA
Cray HPC + D + A = HPDA
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure prediction
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Towards a Query Rewriting Algorithm Over Proteomics XML Resources
Towards a Query Rewriting Algorithm Over Proteomics XML ResourcesTowards a Query Rewriting Algorithm Over Proteomics XML Resources
Towards a Query Rewriting Algorithm Over Proteomics XML Resources
 
Computational Analysis with ICM
Computational Analysis with ICMComputational Analysis with ICM
Computational Analysis with ICM
 
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...
 
SDC: A Distributed Clustering Protocol
SDC: A Distributed Clustering ProtocolSDC: A Distributed Clustering Protocol
SDC: A Distributed Clustering Protocol
 
Drug discovery presentation
Drug discovery presentationDrug discovery presentation
Drug discovery presentation
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIESENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
 
Enhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologiesEnhancing keyword search over relational databases using ontologies
Enhancing keyword search over relational databases using ontologies
 
cr500606e
cr500606ecr500606e
cr500606e
 
Modelling Proteins By Computational Structural Biology
Modelling Proteins By Computational Structural BiologyModelling Proteins By Computational Structural Biology
Modelling Proteins By Computational Structural Biology
 

Poster wellcome-trust-2013-herhaus-fuzzy-mcs

  • 1. Improving Cluster Representations by "Fuzzifying" Maximum Common Substructures Christian Herhaus Merck KGaA, Computational Chemistry, Frankfurter Str. 250, 64293 Darmstadt, christian.herhaus@merckgroup.com Introduction Results Arranging similar structures in clusters is one of the typical tasks of modern Chemoinformatics with high impact in HTS follow-up, generation of structure activity relationships (SAR) and selection of starting points for compound optimization. The approach presented here allows the generation of MCS also for similarity-based clusters with a given inherent structural diversity. It does so by utilizing query features like atoms lists and query bonds to add "fuzziness" in atom and bond information to the MCS. Methods for cluster generation are as diverse as the structures which they are applied to [1], may they be e.g. similarity- or substructure-based. Typically, medicinal chemists tend to orientate themselves in structure subsets like clusters with the help of substructures, so-called "scaffolds", which intuitively characterize the structural relationships between the molecules of the subset. As a result, the generated "fuzzy" MCS, although still being fully database-searchable, is more meaningful for the characterisation of clusters as it can cover larger parts of the full structures than a conventional MCS could do. The approach was implemented in Pipeline Pilot™ for proof of concept but is general enough to be transferred to other technical platforms as well. In the case of substructure-based clustering, well established methods are existing for generation of Maximum Common Substructures (MCS) which are present in all members of the structure population or a defined proportion thereof [2]. But in the case of similaritybased clusters, such MCS may either not be existing for the required dataset proportion or the common substructure may be so small that it is no longer representative and therefore lacking information. Conventional MCS Aim of this approach is to provide descriptive scaffold structures also for clusters whose intrinsic structural diversity is too high to be characterized by conventional Maximum Common Substructure approaches. "Fuzzy" MCS Methods any s/d s/d s/d Content of the Pipeline Pilot™ component "Generate Fuzzy MCS" (slightly simplified) 1. Reduced graphs 2. Generalized MCS any any any any any any any any any any any any any any any any any any any any any any any any any any any any any any 6. Query bonds coded as stereo bonds 7. Final "fuzzy" MCS any s/d s/d 3. Generalized MCS mapped onto structures 4. Structures reduced & mappings normalized 6 1 1 1 17 11 11 11 9 7 1 13 11 16 5. Collected atom/bond information transfered into queries Atom No 1 7 11 Elements O,O,S N,N,O C,C,C Atom No 1 7 11 Element [O,S] [N,O] C Bond No 3 9 14 Types 1,1,1 1,2,1 2,1,4 Bond No 3 9 14 Type single single/double aromatic 7 7 20 s/d 7 Where to get / How to contribute The current version of the component is published on the Accelrys® Pipeline Pilot™ user forum and can be downloaded from https://community.accelrys.com/message/16528. Contributors are appreciated for improving the approach and for transfering the concept to other technology platforms like RDKit, CDK or KNIME. Current limitations • Aromaticity may cause unexpected effects (e.g. single/double vs. aromatic bonds) • Symmetric substructures may blur atom/bond information (e.g. carboxyl or nitro groups) • Currently, just one presence of an atom/bond feature leads to inclusion into a query feature. For larger clusters, a percentual threshold may be required • Different ring sizes and chain lengths can so far not be described by query patterns. [1] Downs GM, Barnard JM: Clustering Methods and Their Uses in Computational Chemistry. In Reviews in Computational Chemistry (18). Chichester: Wiley and Sons; 2003, 1-40. [2] Ehrlich HC, Rarey M: Maximum common subgraph isomorphism algorithms and their applications in molecular science: a review. WIREs Comput Mol Sci 2011, 1:68-79.