SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
SmallWorld: Efficient Maximum
Common Subgraph Searching of
  Large Chemical Databases

Roger Sayle, Jose Batista and Andrew Grant
   NextMove Software, Cambridge, UK
   AstraZeneca R&D, Alderley Park, UK

ACS National Meeting, Philadelphia, USA 22nd August 2012
2d chemical similarity
90%
        79.4%
80%                  76.0%         75.6%
70%                                              66.2%         65.2%   63.7%   64.0%
60%
50%
                                                                                       42.0%
40%
30%
20%
10%
0%
       ECFP_4       FCFP_8        LINGOs        MACCS Marvin CF JChem CF Daylight      IBM


      Briem & Lessel bioactivity benchmark [2000]
        ACS National Meeting, Philadelphia, USA 22nd August 2012
the myth of fingerprints

• The most popular chemical similarity approaches
  reduce the graph representation of a molecule to a
  vector of local features, and compare these.
• “All the right notes, just not in the right order”.
• Binary fingerprints (MACCS and Daylight) suffer from
  feature saturation where all long alkanes, proteins
  and nucleic acids have the same fingerprint.




     ACS National Meeting, Philadelphia, USA 22nd August 2012
counter intuition



                         Daylight Tanimoto: 0.44




                         Daylight Tanimoto: 0.39


ACS National Meeting, Philadelphia, USA 22nd August 2012
String Edit distance

• In 1965, Vladimir Levenshtein introduced string edit
  distance as similarity measure between strings.
• The minimum number of insertions, deletions and
  substitutions to transform one string to another.
• Thanks to efficient dynamic programming algorithms
  of Needleman-Wunsch and Smith-Waterman,
  sequence alignment is at the core of bioinformatics.
• Prior to string edit distance, similarity between
  sequences was performed by heuristic methods
  locally comparing short words, or k-tuples.
     ACS National Meeting, Philadelphia, USA 22nd August 2012
graph Edit distance

• Graph Edit Distance (GED) is the extension of this
  concept to graphs, as the minimum number of edit
  operations required to transform one graph into
  another.
   – Alberto Sanfeliu and K.S. Fu, “A Distance Measure between
     Attributed Relational Graphs for Pattern Recognition”, IEEE
     Transactions of Systems, Man and Cybernetics (SMC), Vol.
     13, No. 3, pp. 353-362, 1983.
• Edit operations consist of insertions, deletions and
  substitutions of nodes and edges (atoms and bonds).

      ACS National Meeting, Philadelphia, USA 22nd August 2012
gED and mCES

• Unfortunately, calculating GED has been shown to be
  a generalization of calculating Maximum Common
  Subgraph (MCS), a favorite of chemists but known to
  be computationally very expensive (NP-Hard).
   – Horst Bunke, “On a Relation between Graph Edit Distance
     and Maximum Common Subgraph”, Pattern Recognition
     Letters, Vol. 18, No. 8, pp. 689-694, 1997.
   – Horst Bunke and Kim Shearer, “A Graph Distance Metric
     Based on the Maximal Common Subgraph”, Pattern
     Recognition Letters, Vol. 19, pp. 255-259, 1998.


      ACS National Meeting, Philadelphia, USA 22nd August 2012
advances in computer science

• Fortunately, recent developments in theoretical
  computer science and advances in processing power
  and disk storage finesse the problem.
  – B.T. Messmer and H. Bunke, “Subgraph Isomorphism in
    Polynomial Time”, Tech. Report, Univ. Berlin, 1995.
  – B.T. Messmer and H. Bunke, “Subgraph Isomorphism
    Detection in Polynomial Time on Preprocessed Graphs”, In
    Proc. Asian Conf. on Computer Vision, pp. 151-155, 1995.
  – Michel Neuhaus and Horst Bunke, “Bridging the Gap
    Between Graph Edit Distance and Kernel Machines”, World
    Scientific Press, 2007.
     ACS National Meeting, Philadelphia, USA 22nd August 2012
The secret... pre-processing

• Imagine in advance enumerating, canonicalizing and
  storing all subgraphs for a given molecule, sorted by
  the number of bonds.
• The task of finding the largest common subgraph
  then becomes an almost trivial ordered intersection.
• Modern state-of-the-art cheminformatics systems
  can easily handle databases several orders of
  magnitude larger than the current largest non-virtual
  databases (i.e. many billions of subgraphs).

     ACS National Meeting, Philadelphia, USA 22nd August 2012
Smallworld chemical space




ACS National Meeting, Philadelphia, USA 22nd August 2012
a grand unified theory

• The three fundamental of forces of cheminformatics:
    – Molecular Identity
    – Substructure Search
    – Chemical Similarity


•   Bemis and Murcko Scaffolds
•   Shuffenhauer Scaffold Trees
•   Feature, Path and Radial Fingerprints
•   Matched Molecular Pairs

       ACS National Meeting, Philadelphia, USA 22nd August 2012
graph vs. subgraph enumeration

• All traditional MCS algorithms start from one bond
  and grow; SmallWorld starts from N bonds and
  shrinks.
• In typical databases, highly similar molecules which
  cause problems for MCS, are found quickly, typically
  terminating the search.
• The work of Jean-Louis Raymond et al. on GDB-15,
  using Brendan McKay’s GENG to enumerate all
  possible graphs, significantly over estimates the
  number of (sub)graphs encountered in practice.
     ACS National Meeting, Philadelphia, USA 22nd August 2012
Efficient subgraph enumeration

• A connected Maximum Common Edge Subgraph
  (MCES) with one less bond is formed by either (i)
  deleting a bond to a terminal atom, or (ii) deleting a
  ring (cyclic) bond.

• Partitioning cyclic from acyclic bonds can be done
  efficiently in O(N) time, and even this only needs to
  be recalculated after deleting a ring bond, as deleting
  terminal bonds doesn’t affect ring membership.

      ACS National Meeting, Philadelphia, USA 22nd August 2012
subgraph counts of molecules
     Name                        Atoms           MW          Anon SGs    Elem SGs
     Benzene                             6            78            7           7
     Cubane                              8           104           64          64
     Ferrocene                         11            186        3,154       3,219
     Aspirin                           13            180          127         332
     Dodecahedrane                     20            260      440,473     440,473
     Ranitidine                        21            314          436       1,207
     Clopidrogel                       21            322       10,071      22,170
     Morphine                          21            285      176,541     496,467
     Amlodipine                        28            409       58,139     147,128
     Lisinopril                        29            405       24,619      34,496
     Gefitinib                         31            447      190,901     337,174
     Atorvastatin                      41            559     3,638,523   6,019,427

  ACS National Meeting, Philadelphia, USA 22nd August 2012
molecule DB size distribution
   ≤ Bond Count   % MDDR 2011.2   % NCBI PubChem
    ≤ 20 bonds         6%              14%
    ≤ 25 bonds        18%              30%
    ≤ 30 bonds        36%              55%
    ≤ 35 bonds        56%              77%
    ≤ 40 bonds        73%              89%
    ≤ 45 bonds        83%              93%
    ≤ 50 bonds        89%              95%
    ≤ 55 bonds        92%              97%
    ≤ 60 bonds        93%              98%
    ≤ 65 bonds        94%              98%
    ≤ 70 bonds        95%              99%
smallworld search




SmallWorld lattice: Bold circles denote indexed molecules,
thin circles represent virtual subgraphs.
     ACS National Meeting, Philadelphia, USA 22nd August 2012
smallworld search




The solid circle denotes a query structure which may be
either an indexed molecule or a virtual subgraph.
     ACS National Meeting, Philadelphia, USA 22nd August 2012
smallworld search




The first iteration of the search adds the neigbors of the
query to the “search wavefront”.
     ACS National Meeting, Philadelphia, USA 22nd August 2012
smallworld search




Each subsequent iteration propagates the wavefront by
considering the unvisited neighbors of the wavefront.
    ACS National Meeting, Philadelphia, USA 22nd August 2012
smallworld search




At each iteration, “hits” are reported as the set of indexed
molecules that are members of the wavefront.
     ACS National Meeting, Philadelphia, USA 22nd August 2012
smallworld search




The search terminates once sufficient indexed neighbors
have been found (or a suitable iteration limit is reached).
     ACS National Meeting, Philadelphia, USA 22nd August 2012
smallworld search




ACS National Meeting, Philadelphia, USA 22nd August 2012
smallworld search




ACS National Meeting, Philadelphia, USA 22nd August 2012
algorithm references

• Scott Beamer et al. “Seaching for a Parent Instead of
  Fighting Over Children: A Fast Breadth-First Search
  Implementation for Graph500”, UCB Tech. Rep. 2011.
• Ricardo A. Baeza-Yates, “A Fast Set Intersection
  Algorithm for Sorted Sequences”, In. Proc. 15th
  Symposium on Combinatorial Pattern Matching,
  LNCS Vol. 3109, pp. 400-408, 2004.




     ACS National Meeting, Philadelphia, USA 22nd August 2012
proof-of-concept db statistics

• To evaluate design ideas a small prototype proof-of-
  concept “anonymous” database was implemented.
• 202,169,109 nodes (~228 nodes)
• 1,763,328,320 edges (~231 edges)
• Storage requirements: ~75Gbytes.
• Average degree (fan-out) of node: 17.44
• 62.5M acyclic nodes, 78.3M have single ring.
• 1,030,565,730 edges were terminal edges and
  732,762,590 edges were ring deletion edges.
     ACS National Meeting, Philadelphia, USA 22nd August 2012
example search statistics

• A Briem & Lessel benchmark run, searching for the
  (upto) 10 nearest neighbors of each of 380 drug-
  sized query molecules takes 8m42s (less than 1.4
  seconds per query) on single thread i7-2600/16GB.
• The median distance for each search was 8 edits,
  with the shortest finishing after 2 edits and the
  furthest reaching 24 edits.
• Limiting to 10 edits reduced search time to 7 mins.
• Worst case “wavefront” size was 6,624,624 nodes
  which occurred at distance/iteration nine.
     ACS National Meeting, Philadelphia, USA 22nd August 2012
connected vs. disconnected mcs

• A frequent question which maximum common
  subgraph methods is whether to perform connected
  or disconnected MCS.
• One solution to this dilemma with SmallWorld is to
  allow an additional “edit operation” that can
  eliminate a vertex of degree two.
• This captures the similarity between two molecules
  that differ in the length of a linker or by the size of a
  ring [classic challenges with MCS similarity].

      ACS National Meeting, Philadelphia, USA 22nd August 2012
representing activity cliffs




A useful feature of graph edit distance is intuitive
interpretation of GED(A,B) < GED(A,C).
  ACS National Meeting, Philadelphia, USA 22nd August 2012
conclusions

• The sub-linear behaviour of SmallWorld’s nearest
  neighbor calculation makes it faster than fingerprint-
  based similarity methods for sufficiently large data
  sets.
• This is thanks to the “blessing of dimensionality”.
• With continual advances in computer hardware,
  SmallWorld is likely to become the basis of most
  chemical similarity calculations within a decade.



      ACS National Meeting, Philadelphia, USA 22nd August 2012
acknowledgements

• AstraZeneca R&D for their support.


• Thank you for your time.
• Any questions?




     ACS National Meeting, Philadelphia, USA 22nd August 2012

Contenu connexe

Tendances

Bank system presentation
Bank system presentationBank system presentation
Bank system presentation
Eslam Mohammed
 
cvpr2011: human activity recognition - part 3: single layer
cvpr2011: human activity recognition - part 3: single layercvpr2011: human activity recognition - part 3: single layer
cvpr2011: human activity recognition - part 3: single layer
zukun
 

Tendances (20)

How AI Enhances & Accelerates Diabetic Retinopathy Detection
How AI Enhances & Accelerates Diabetic Retinopathy DetectionHow AI Enhances & Accelerates Diabetic Retinopathy Detection
How AI Enhances & Accelerates Diabetic Retinopathy Detection
 
The Power Of Open Banking Coupled With Artificial Intelligence
The Power Of Open Banking Coupled With Artificial IntelligenceThe Power Of Open Banking Coupled With Artificial Intelligence
The Power Of Open Banking Coupled With Artificial Intelligence
 
Verifiable credentials explained by CCI
Verifiable credentials explained by CCIVerifiable credentials explained by CCI
Verifiable credentials explained by CCI
 
How Banking as a Service Will Keep Banks Digitally Relevant and Growing
How Banking as a Service Will Keep Banks Digitally Relevant and GrowingHow Banking as a Service Will Keep Banks Digitally Relevant and Growing
How Banking as a Service Will Keep Banks Digitally Relevant and Growing
 
Global Payment Reference Architecture
Global Payment Reference ArchitectureGlobal Payment Reference Architecture
Global Payment Reference Architecture
 
You only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detectionYou only look once (YOLO) : unified real time object detection
You only look once (YOLO) : unified real time object detection
 
UKCCC: Open Banking Introduction
UKCCC: Open Banking IntroductionUKCCC: Open Banking Introduction
UKCCC: Open Banking Introduction
 
BaaS-platforms and open APIs in fintech l bank-as-a-service.com
BaaS-platforms and open APIs in fintech l bank-as-a-service.comBaaS-platforms and open APIs in fintech l bank-as-a-service.com
BaaS-platforms and open APIs in fintech l bank-as-a-service.com
 
DBX Open Banking
DBX Open BankingDBX Open Banking
DBX Open Banking
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
Practical Digital Image Processing 4
Practical Digital Image Processing 4Practical Digital Image Processing 4
Practical Digital Image Processing 4
 
Bank system presentation
Bank system presentationBank system presentation
Bank system presentation
 
eCommerce Payment Gateways: An Introduction
eCommerce Payment Gateways: An IntroductioneCommerce Payment Gateways: An Introduction
eCommerce Payment Gateways: An Introduction
 
Open banking-Future of Banking
Open banking-Future of BankingOpen banking-Future of Banking
Open banking-Future of Banking
 
#10 pydata warsaw object detection with dn ns
#10   pydata warsaw object detection with dn ns#10   pydata warsaw object detection with dn ns
#10 pydata warsaw object detection with dn ns
 
cvpr2011: human activity recognition - part 3: single layer
cvpr2011: human activity recognition - part 3: single layercvpr2011: human activity recognition - part 3: single layer
cvpr2011: human activity recognition - part 3: single layer
 
Deep learning based object detection basics
Deep learning based object detection basicsDeep learning based object detection basics
Deep learning based object detection basics
 
Fintech : An Innovation in Financial Industry by Fintech App Development Company
Fintech : An Innovation in Financial Industry by Fintech App Development CompanyFintech : An Innovation in Financial Industry by Fintech App Development Company
Fintech : An Innovation in Financial Industry by Fintech App Development Company
 
[PR12] Capsule Networks - Jaejun Yoo
[PR12] Capsule Networks - Jaejun Yoo[PR12] Capsule Networks - Jaejun Yoo
[PR12] Capsule Networks - Jaejun Yoo
 

En vedette

Cables Crossover y Straight
Cables Crossover y StraightCables Crossover y Straight
Cables Crossover y Straight
Oscar Lopez
 
Introduction to Mastering Google Analytics
Introduction to Mastering Google AnalyticsIntroduction to Mastering Google Analytics
Introduction to Mastering Google Analytics
Edureka!
 
PRESENTACIÓN SYNERGIE ETT
PRESENTACIÓN SYNERGIE ETTPRESENTACIÓN SYNERGIE ETT
PRESENTACIÓN SYNERGIE ETT
Jaimeamat
 
Intermanager kpi
Intermanager kpiIntermanager kpi
Intermanager kpi
Imran Khan
 
Usar internet de forma segura
Usar internet de forma seguraUsar internet de forma segura
Usar internet de forma segura
mlau86
 
my resume 12.2015
my resume 12.2015my resume 12.2015
my resume 12.2015
Ben Pratt
 
A. Sterchi Morton, Jr., J.D., M.B.A. Resume 2016
A. Sterchi Morton, Jr., J.D., M.B.A. Resume 2016A. Sterchi Morton, Jr., J.D., M.B.A. Resume 2016
A. Sterchi Morton, Jr., J.D., M.B.A. Resume 2016
Sterchi Morton, Jr., J.D.
 

En vedette (20)

Les cases més cares del món ester
Les cases més cares del món esterLes cases més cares del món ester
Les cases més cares del món ester
 
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircleFinding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
 
Dagstuhl seminar talk on querying big graphs
Dagstuhl seminar talk on querying big graphsDagstuhl seminar talk on querying big graphs
Dagstuhl seminar talk on querying big graphs
 
Inmersion Nature Recyclers
Inmersion Nature RecyclersInmersion Nature Recyclers
Inmersion Nature Recyclers
 
Deeply Rooted Dance Theater Presents: Emerging Choreographers Showcase
Deeply Rooted Dance Theater Presents: Emerging Choreographers ShowcaseDeeply Rooted Dance Theater Presents: Emerging Choreographers Showcase
Deeply Rooted Dance Theater Presents: Emerging Choreographers Showcase
 
Cables Crossover y Straight
Cables Crossover y StraightCables Crossover y Straight
Cables Crossover y Straight
 
Introduction to Mastering Google Analytics
Introduction to Mastering Google AnalyticsIntroduction to Mastering Google Analytics
Introduction to Mastering Google Analytics
 
Rahul_CV
Rahul_CVRahul_CV
Rahul_CV
 
PRESENTACIÓN SYNERGIE ETT
PRESENTACIÓN SYNERGIE ETTPRESENTACIÓN SYNERGIE ETT
PRESENTACIÓN SYNERGIE ETT
 
Virus y Vacunas Informáticas
Virus y Vacunas InformáticasVirus y Vacunas Informáticas
Virus y Vacunas Informáticas
 
Intermanager kpi
Intermanager kpiIntermanager kpi
Intermanager kpi
 
Usar internet de forma segura
Usar internet de forma seguraUsar internet de forma segura
Usar internet de forma segura
 
Konus attachment(6:28)
Konus attachment(6:28)Konus attachment(6:28)
Konus attachment(6:28)
 
Cover lakip 2012
Cover lakip 2012Cover lakip 2012
Cover lakip 2012
 
Ls Pipe Equipment
Ls Pipe EquipmentLs Pipe Equipment
Ls Pipe Equipment
 
Seacorp presentacion
Seacorp presentacionSeacorp presentacion
Seacorp presentacion
 
Reseña contable
Reseña contable Reseña contable
Reseña contable
 
my resume 12.2015
my resume 12.2015my resume 12.2015
my resume 12.2015
 
Revista SM / Solaray informa Septiembre 2010
Revista SM / Solaray informa Septiembre 2010Revista SM / Solaray informa Septiembre 2010
Revista SM / Solaray informa Septiembre 2010
 
A. Sterchi Morton, Jr., J.D., M.B.A. Resume 2016
A. Sterchi Morton, Jr., J.D., M.B.A. Resume 2016A. Sterchi Morton, Jr., J.D., M.B.A. Resume 2016
A. Sterchi Morton, Jr., J.D., M.B.A. Resume 2016
 

Similaire à SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Databases

Analysis and Implementation of Particle Swarm Optimization Technique for Redu...
Analysis and Implementation of Particle Swarm Optimization Technique for Redu...Analysis and Implementation of Particle Swarm Optimization Technique for Redu...
Analysis and Implementation of Particle Swarm Optimization Technique for Redu...
ijtsrd
 
The Algorithms of Life - Scientific Computing for Systems Biology
The Algorithms of Life - Scientific Computing for Systems BiologyThe Algorithms of Life - Scientific Computing for Systems Biology
The Algorithms of Life - Scientific Computing for Systems Biology
inside-BigData.com
 
DiStefano_Poster_ABRCMS2015_FINAL
DiStefano_Poster_ABRCMS2015_FINALDiStefano_Poster_ABRCMS2015_FINAL
DiStefano_Poster_ABRCMS2015_FINAL
Tyler DiStefano
 
Robert kiss acs_2012_sd_upload
Robert kiss acs_2012_sd_uploadRobert kiss acs_2012_sd_upload
Robert kiss acs_2012_sd_upload
rkiss81
 

Similaire à SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Databases (20)

Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
Efficient Perception of Proteins and Nucleic Acids from Atomic ConnectivityEfficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
 
Efficient Searching and Similarity of Unmapped Reactions: Application to ELN ...
Efficient Searching and Similarity of Unmapped Reactions: Application to ELN ...Efficient Searching and Similarity of Unmapped Reactions: Application to ELN ...
Efficient Searching and Similarity of Unmapped Reactions: Application to ELN ...
 
Smallworld : Efficient maximum common substructure searching of large databases
Smallworld : Efficient maximum common substructure searching of large databasesSmallworld : Efficient maximum common substructure searching of large databases
Smallworld : Efficient maximum common substructure searching of large databases
 
Analysis and Implementation of Particle Swarm Optimization Technique for Redu...
Analysis and Implementation of Particle Swarm Optimization Technique for Redu...Analysis and Implementation of Particle Swarm Optimization Technique for Redu...
Analysis and Implementation of Particle Swarm Optimization Technique for Redu...
 
366 jevin west
366 jevin west366 jevin west
366 jevin west
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
 
The Algorithms of Life - Scientific Computing for Systems Biology
The Algorithms of Life - Scientific Computing for Systems BiologyThe Algorithms of Life - Scientific Computing for Systems Biology
The Algorithms of Life - Scientific Computing for Systems Biology
 
Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptx
 
Bda2015 tutorial-part2-data&amp;databases
Bda2015 tutorial-part2-data&amp;databasesBda2015 tutorial-part2-data&amp;databases
Bda2015 tutorial-part2-data&amp;databases
 
Making effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsMaking effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computations
 
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
 
DiStefano_Poster_ABRCMS2015_FINAL
DiStefano_Poster_ABRCMS2015_FINALDiStefano_Poster_ABRCMS2015_FINAL
DiStefano_Poster_ABRCMS2015_FINAL
 
Solar Cell Final Paper
Solar Cell Final PaperSolar Cell Final Paper
Solar Cell Final Paper
 
Bruce Damer's talk at EE380, the Stanford University Computer Systems Colloqu...
Bruce Damer's talk at EE380, the Stanford University Computer Systems Colloqu...Bruce Damer's talk at EE380, the Stanford University Computer Systems Colloqu...
Bruce Damer's talk at EE380, the Stanford University Computer Systems Colloqu...
 
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of L...
 
Robert kiss acs_2012_sd_upload
Robert kiss acs_2012_sd_uploadRobert kiss acs_2012_sd_upload
Robert kiss acs_2012_sd_upload
 
Extracting medicinal chemistry knowledge by a secured Matched Molecular Pair ...
Extracting medicinal chemistry knowledge by a secured Matched Molecular Pair ...Extracting medicinal chemistry knowledge by a secured Matched Molecular Pair ...
Extracting medicinal chemistry knowledge by a secured Matched Molecular Pair ...
 
Retinal Blood Vessels Exudates Classification For Detection Of Hemmorages Tha...
Retinal Blood Vessels Exudates Classification For Detection Of Hemmorages Tha...Retinal Blood Vessels Exudates Classification For Detection Of Hemmorages Tha...
Retinal Blood Vessels Exudates Classification For Detection Of Hemmorages Tha...
 
Pankaj_thesis.pdf
Pankaj_thesis.pdfPankaj_thesis.pdf
Pankaj_thesis.pdf
 

Plus de NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 

Plus de NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

SmallWorld : Efficient Maximum Common Subgraph Searching of Large Chemical Databases

  • 1. SmallWorld: Efficient Maximum Common Subgraph Searching of Large Chemical Databases Roger Sayle, Jose Batista and Andrew Grant NextMove Software, Cambridge, UK AstraZeneca R&D, Alderley Park, UK ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 2. 2d chemical similarity 90% 79.4% 80% 76.0% 75.6% 70% 66.2% 65.2% 63.7% 64.0% 60% 50% 42.0% 40% 30% 20% 10% 0% ECFP_4 FCFP_8 LINGOs MACCS Marvin CF JChem CF Daylight IBM Briem & Lessel bioactivity benchmark [2000] ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 3. the myth of fingerprints • The most popular chemical similarity approaches reduce the graph representation of a molecule to a vector of local features, and compare these. • “All the right notes, just not in the right order”. • Binary fingerprints (MACCS and Daylight) suffer from feature saturation where all long alkanes, proteins and nucleic acids have the same fingerprint. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 4. counter intuition Daylight Tanimoto: 0.44 Daylight Tanimoto: 0.39 ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 5. String Edit distance • In 1965, Vladimir Levenshtein introduced string edit distance as similarity measure between strings. • The minimum number of insertions, deletions and substitutions to transform one string to another. • Thanks to efficient dynamic programming algorithms of Needleman-Wunsch and Smith-Waterman, sequence alignment is at the core of bioinformatics. • Prior to string edit distance, similarity between sequences was performed by heuristic methods locally comparing short words, or k-tuples. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 6. graph Edit distance • Graph Edit Distance (GED) is the extension of this concept to graphs, as the minimum number of edit operations required to transform one graph into another. – Alberto Sanfeliu and K.S. Fu, “A Distance Measure between Attributed Relational Graphs for Pattern Recognition”, IEEE Transactions of Systems, Man and Cybernetics (SMC), Vol. 13, No. 3, pp. 353-362, 1983. • Edit operations consist of insertions, deletions and substitutions of nodes and edges (atoms and bonds). ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 7. gED and mCES • Unfortunately, calculating GED has been shown to be a generalization of calculating Maximum Common Subgraph (MCS), a favorite of chemists but known to be computationally very expensive (NP-Hard). – Horst Bunke, “On a Relation between Graph Edit Distance and Maximum Common Subgraph”, Pattern Recognition Letters, Vol. 18, No. 8, pp. 689-694, 1997. – Horst Bunke and Kim Shearer, “A Graph Distance Metric Based on the Maximal Common Subgraph”, Pattern Recognition Letters, Vol. 19, pp. 255-259, 1998. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 8. advances in computer science • Fortunately, recent developments in theoretical computer science and advances in processing power and disk storage finesse the problem. – B.T. Messmer and H. Bunke, “Subgraph Isomorphism in Polynomial Time”, Tech. Report, Univ. Berlin, 1995. – B.T. Messmer and H. Bunke, “Subgraph Isomorphism Detection in Polynomial Time on Preprocessed Graphs”, In Proc. Asian Conf. on Computer Vision, pp. 151-155, 1995. – Michel Neuhaus and Horst Bunke, “Bridging the Gap Between Graph Edit Distance and Kernel Machines”, World Scientific Press, 2007. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 9. The secret... pre-processing • Imagine in advance enumerating, canonicalizing and storing all subgraphs for a given molecule, sorted by the number of bonds. • The task of finding the largest common subgraph then becomes an almost trivial ordered intersection. • Modern state-of-the-art cheminformatics systems can easily handle databases several orders of magnitude larger than the current largest non-virtual databases (i.e. many billions of subgraphs). ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 10. Smallworld chemical space ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 11. a grand unified theory • The three fundamental of forces of cheminformatics: – Molecular Identity – Substructure Search – Chemical Similarity • Bemis and Murcko Scaffolds • Shuffenhauer Scaffold Trees • Feature, Path and Radial Fingerprints • Matched Molecular Pairs ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 12. graph vs. subgraph enumeration • All traditional MCS algorithms start from one bond and grow; SmallWorld starts from N bonds and shrinks. • In typical databases, highly similar molecules which cause problems for MCS, are found quickly, typically terminating the search. • The work of Jean-Louis Raymond et al. on GDB-15, using Brendan McKay’s GENG to enumerate all possible graphs, significantly over estimates the number of (sub)graphs encountered in practice. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 13. Efficient subgraph enumeration • A connected Maximum Common Edge Subgraph (MCES) with one less bond is formed by either (i) deleting a bond to a terminal atom, or (ii) deleting a ring (cyclic) bond. • Partitioning cyclic from acyclic bonds can be done efficiently in O(N) time, and even this only needs to be recalculated after deleting a ring bond, as deleting terminal bonds doesn’t affect ring membership. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 14. subgraph counts of molecules Name Atoms MW Anon SGs Elem SGs Benzene 6 78 7 7 Cubane 8 104 64 64 Ferrocene 11 186 3,154 3,219 Aspirin 13 180 127 332 Dodecahedrane 20 260 440,473 440,473 Ranitidine 21 314 436 1,207 Clopidrogel 21 322 10,071 22,170 Morphine 21 285 176,541 496,467 Amlodipine 28 409 58,139 147,128 Lisinopril 29 405 24,619 34,496 Gefitinib 31 447 190,901 337,174 Atorvastatin 41 559 3,638,523 6,019,427 ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 15. molecule DB size distribution ≤ Bond Count % MDDR 2011.2 % NCBI PubChem ≤ 20 bonds 6% 14% ≤ 25 bonds 18% 30% ≤ 30 bonds 36% 55% ≤ 35 bonds 56% 77% ≤ 40 bonds 73% 89% ≤ 45 bonds 83% 93% ≤ 50 bonds 89% 95% ≤ 55 bonds 92% 97% ≤ 60 bonds 93% 98% ≤ 65 bonds 94% 98% ≤ 70 bonds 95% 99%
  • 16. smallworld search SmallWorld lattice: Bold circles denote indexed molecules, thin circles represent virtual subgraphs. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 17. smallworld search The solid circle denotes a query structure which may be either an indexed molecule or a virtual subgraph. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 18. smallworld search The first iteration of the search adds the neigbors of the query to the “search wavefront”. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 19. smallworld search Each subsequent iteration propagates the wavefront by considering the unvisited neighbors of the wavefront. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 20. smallworld search At each iteration, “hits” are reported as the set of indexed molecules that are members of the wavefront. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 21. smallworld search The search terminates once sufficient indexed neighbors have been found (or a suitable iteration limit is reached). ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 22. smallworld search ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 23. smallworld search ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 24. algorithm references • Scott Beamer et al. “Seaching for a Parent Instead of Fighting Over Children: A Fast Breadth-First Search Implementation for Graph500”, UCB Tech. Rep. 2011. • Ricardo A. Baeza-Yates, “A Fast Set Intersection Algorithm for Sorted Sequences”, In. Proc. 15th Symposium on Combinatorial Pattern Matching, LNCS Vol. 3109, pp. 400-408, 2004. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 25. proof-of-concept db statistics • To evaluate design ideas a small prototype proof-of- concept “anonymous” database was implemented. • 202,169,109 nodes (~228 nodes) • 1,763,328,320 edges (~231 edges) • Storage requirements: ~75Gbytes. • Average degree (fan-out) of node: 17.44 • 62.5M acyclic nodes, 78.3M have single ring. • 1,030,565,730 edges were terminal edges and 732,762,590 edges were ring deletion edges. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 26. example search statistics • A Briem & Lessel benchmark run, searching for the (upto) 10 nearest neighbors of each of 380 drug- sized query molecules takes 8m42s (less than 1.4 seconds per query) on single thread i7-2600/16GB. • The median distance for each search was 8 edits, with the shortest finishing after 2 edits and the furthest reaching 24 edits. • Limiting to 10 edits reduced search time to 7 mins. • Worst case “wavefront” size was 6,624,624 nodes which occurred at distance/iteration nine. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 27. connected vs. disconnected mcs • A frequent question which maximum common subgraph methods is whether to perform connected or disconnected MCS. • One solution to this dilemma with SmallWorld is to allow an additional “edit operation” that can eliminate a vertex of degree two. • This captures the similarity between two molecules that differ in the length of a linker or by the size of a ring [classic challenges with MCS similarity]. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 28. representing activity cliffs A useful feature of graph edit distance is intuitive interpretation of GED(A,B) < GED(A,C). ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 29. conclusions • The sub-linear behaviour of SmallWorld’s nearest neighbor calculation makes it faster than fingerprint- based similarity methods for sufficiently large data sets. • This is thanks to the “blessing of dimensionality”. • With continual advances in computer hardware, SmallWorld is likely to become the basis of most chemical similarity calculations within a decade. ACS National Meeting, Philadelphia, USA 22nd August 2012
  • 30. acknowledgements • AstraZeneca R&D for their support. • Thank you for your time. • Any questions? ACS National Meeting, Philadelphia, USA 22nd August 2012