SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Clustering made human



Miklos Vargyas



                        •Solutions for Cheminformatics
Cluster in computing
Computer cluster




                                      3
Cluster in Chemistry
Transition metal carbonyl clusters




Dimanganese-decacarbonyl                    di-tungsten tetra(hpp)




Transition metal halide clusters
Boron hydrides
Gas-phase clusters and fullerenes
                                                        4
Cluster in Chemistry/Physics

Nanoscale particles
• Fullerenes
• Nano machines




                              Images produced by MarvinSpace

                                                     5
Star cluster

gravitationally bound groups of stars




                       Image from Wikipedia, the free encyclopedia
                                                                     6
Clustering cars

Live demonstration


Group by property
• Shape, size, type, brand, colour
• Many possible arrangement, multiple aspects
Group by similarity
• Categorial perception



                                            7
Why is clustering stars easy?

God did the job for us!
• Stars have an apparent spatial arrangement
• Distance between stars defines clusters




                                               8
Why is clustering cars hard?

Lack of innate spatial arrangement
 • Artificial arrangement
 • Various approaches, no superior one
 • “Cars come in all shapes and sizes”
Problem of dimensionality
• Why 2?!




                                           9
So what about Molecules

Are they like stars or rather like cars?
 • They come in all shapes and sizes
 • Vast number of properties
Chemical spaces
 • Select molecular properties
 • Estimate or measure them
 • Use them as coordinates
 • Place your molecules as points in this abstract space
 • Group that are close to each other to form clusters



                                                       10
Example in 2D




            11
Further attempts in 2D
                                 300
                                 250
                                 200




                          logP
                                 150
                                 100
                                 50

            300                   0
                                       0           200   400           600    800   1000
            250                                                tpsa
            200
mass




            150
            100
            50
             0
       -2         0   2                4          6      8            10     12
                                           tpsa


                                                                                           12
Molecule clusters by similarity

Jarvis-Patrick clustering
 • Fast SC1000.cfp -m 0 -f 1024 -t 0.6 -c
 jarp -i                                    0.1

 • Tanimoto -o SC1000.jarp.t0.6.c0.1 –g
       -y -z similarity

 • Globular clusters
 Number of objects = 999
 • Tendency to create large singletons) =
 Number of clusters (without
                             number of      2
   singletons
 Number of singletons = 8
 • Molecular properties & fingerprint
Average dissimilarity = 0.66208726
Minimum dissimilarity = 0.0
Maximum dissimilarity = 0.9411765

                                                  13
Parameter tuning


 t    c     Clusters   singletons

0.6   0.1         2           8

0.3   0.1       179         248

0.5   0.1         7          36




                                    14
The most populated cluster




                         15
Parameter tuning

 t    c     Clusters   singletons

0.6   0.1         2           8

0.3   0.1       179         248

0.5   0.1         7          36

0.5   0.5        10          37

0.5   0.8        81         115


                                    16
Another cluster




              17
So what’s wrong with that?
1. manual tuning
2. lack of interpretability


3. need:
4. automated (unsupervised) techniques
5. easy to grasp simple to understand “explanations”


6. one possible solutions: MCS based clustering



                                                       18
Maximum Common Substructure

Largest substructure shared by two molecules
MCS




Simple concept! More human, visual.
Yet hard (= expensive (= slow)) to compute..

                                               19
MCS of a structure set




                     20
Hierarchical star clusters

star




                                21
Hierarchical star clusters

star cluster
 • star




                                        22
Hierarchical star clusters

galaxy
 • star cluster
   – star




                                           23
Hierarchical star clusters

local group
• galaxy
   – star cluster
       star




                                             24
Hierarchical star clusters

supercluster
  • cluster
    – local group
       galaxy
         » star cluster




                                                   25
Visualisation of hierarchy

Dendrogram




                                      26
Hierarchical MCS




               27
Intuitive visualisation




                      28
SAR table view




             29
R-group deconvolusion




                    30
Speed-up achieved last year

                     4000

                     3500
                                       2006
                     3000              2007
                                       Linear (2007)
Running time (sec)




                     2500

                     2000

                     1500

                     1000

                     500

                       0

                     -500
                            0   5000   10000           15000       20000    25000   30000   35000
                                                        Structure count


                                                                           Presented at UGM’07
                                                                                                 31
Speed-up achieved this year

                     4000

                     3500
                                         2006
                     3000                2007
                                         2008
Running time (sec)




                     2500

                     2000

                     1500

                     1000

                     500

                       0
                            0   5000   10000    15000     20000   25000   30000   35000
                                                Structure count




                                                                                          32
Speed-up this year

                     10000



                     1000
Running time (sec)




                      100                       2006
                                                2007
                                                2008
                       10



                        1



                       0.1
                             0   5000   10000   15000      20000   25000   30000   35000
                                                 Structure count




                                                                                           33
Clustering performance comparison

                     90
                     80
                                      LibraryMCS
Running time (min)




                     70
                     60               Jarvis-Patrick
                                      Ward-Murtagh
                     50
                     40
                     30
                     20
                     10
                      0
                          0   20000   40000       60000         80000   100000   120000
                                              Structure count



                                                                                      34
Find out more

Product descriptions & links
 www.chemaxon.com/products.html

Forum
 www.chemaxon.com/forum

Presentations and posters
 www.chemaxon.com/conf

Download
          www.chemaxon.com/downlo
ad.html




                                                35

Contenu connexe

Similaire à Clustering Made Human: US UGM 2008

Hadoop and Cloud at Netflix
Hadoop and Cloud at NetflixHadoop and Cloud at Netflix
Hadoop and Cloud at Netflix
DataWorks Summit
 
Paper and pencil_cosmological_calculator
Paper and pencil_cosmological_calculatorPaper and pencil_cosmological_calculator
Paper and pencil_cosmological_calculator
Sérgio Sacani
 

Similaire à Clustering Made Human: US UGM 2008 (10)

Model Compression
Model CompressionModel Compression
Model Compression
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applications
 
Hadoop and Cloud at Netflix
Hadoop and Cloud at NetflixHadoop and Cloud at Netflix
Hadoop and Cloud at Netflix
 
Holistic modelling of mineral processing plants a practical approach
Holistic modelling of mineral processing plants   a practical approachHolistic modelling of mineral processing plants   a practical approach
Holistic modelling of mineral processing plants a practical approach
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
talk9.ppt
talk9.ppttalk9.ppt
talk9.ppt
 
Towards a 2-dimensional Self-organized Framework for Structured Population-ba...
Towards a 2-dimensional Self-organized Framework for Structured Population-ba...Towards a 2-dimensional Self-organized Framework for Structured Population-ba...
Towards a 2-dimensional Self-organized Framework for Structured Population-ba...
 
Paper and pencil_cosmological_calculator
Paper and pencil_cosmological_calculatorPaper and pencil_cosmological_calculator
Paper and pencil_cosmological_calculator
 
Steel Jacketed Rc Column
Steel Jacketed Rc ColumnSteel Jacketed Rc Column
Steel Jacketed Rc Column
 
Mass balancing techniques The Midas approach
Mass balancing techniques The  Midas approachMass balancing techniques The  Midas approach
Mass balancing techniques The Midas approach
 

Plus de ChemAxon

Translating data to predictive models
Translating data to predictive modelsTranslating data to predictive models
Translating data to predictive models
ChemAxon
 

Plus de ChemAxon (20)

Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
 
Chemaxon EU UGM 2022 | Translating data to predictive models
Chemaxon EU UGM 2022 | Translating data to predictive modelsChemaxon EU UGM 2022 | Translating data to predictive models
Chemaxon EU UGM 2022 | Translating data to predictive models
 
Translating data to predictive models
Translating data to predictive modelsTranslating data to predictive models
Translating data to predictive models
 
Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...
 
Biomolecule structural data management
Biomolecule structural data managementBiomolecule structural data management
Biomolecule structural data management
 
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first releaseCheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
 
Enhanced stereochemistry representation
Enhanced stereochemistry representation Enhanced stereochemistry representation
Enhanced stereochemistry representation
 
Intellectual property (IP) intelligence solutions designed for the way resear...
Intellectual property (IP) intelligence solutions designed for the way resear...Intellectual property (IP) intelligence solutions designed for the way resear...
Intellectual property (IP) intelligence solutions designed for the way resear...
 
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
 
Patent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug DiscoveryPatent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug Discovery
 
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
 
Research data management on the cloud
Research data management on the cloudResearch data management on the cloud
Research data management on the cloud
 
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound RegistrationCheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
 
Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction
 
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
 
Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology
 
JChem Microservices
JChem MicroservicesJChem Microservices
JChem Microservices
 
Migration from joc to jpc or choral
Migration from joc to jpc or choralMigration from joc to jpc or choral
Migration from joc to jpc or choral
 
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
 
Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Clustering Made Human: US UGM 2008