SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
TunUp: A Distributed Cloud-based
Genetic Evolutionary Tuning for Data
Clustering

Gianmario Spacagna
gm.spacagna@gmail.com

March 2013



AgilOne, Inc.
1091 N Shoreline Blvd. #250
Mountain View, CA 94043
Agenda
1.   Introduction
2.   Problem description
3.   TunUp
4.   K-means
5.   Clustering evaluation
6.   Full space tuning
7.   Genetic algorithm tuning
8.   Conclusions
Big Data
Business Intelligence
        Why ? Where? What? How?
         Insights of customers, products and companies




   Can someone else know your customer better than you?
  Do you have the domain knowledge and proper computation
                      infrastructure?
Big Data as a Service (BDaaS)
Problem Description




          income   cost




                      customers
Tuning of Clustering
Algorithms
We need tuning when:
    ➢
        New algorithm or version is released
    ➢
        We want to improve accuracy and/or performance
    ➢
        New customer comes and the system must be adapted for the new
        dataset and requirements




9
TunUp
Java framework integrating JavaML and Watchmaker

Main features:

➢
    Data manipulation (loading, labelling and normalization)
➢
    Clustering algorithms (k-means)
➢
    Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand)
➢
    Evaluation techniques validation (Pearson Correlation t-test)
➢
    Full search space tuning
➢
    Genetic Algorithm tuning (local and parallel implementation)
➢
    RESTful API for web ser vice deployment (tomcat in Amazon EC2)

    Open-source: http://github.com/gm-spacagna/tunup
k-means
Geometric hard-assigning Clustering algorithm:
   It partitions n data points into k clusters in which each point belongs to
   the cluster with the nearest mean centroid.
     If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified
     cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares:




      Algorithm:
1.    Initialization : a set of k random centroids are generated
2.    Assignment: each point is assigned to the closest centroid
3.    Update: the new centroids are calculated as the mean of the new clusters
4.    Go to 2 until the convergence (centroids are stable and do not change)
k-means tuning
     Input parameters required:        0.   Angular
                                       2.   Chebyshev
1.   K = (2,...,40)                    3.   Cosine
                                       4.   Euclidean
2.   Distance measure                  5.   Jaccard Index
                                       6.   Manhattan
3.   Max iterations = 20 (fixed)       7.   Pearson Correlation Coefficient
                                       8.   Radial Basis Function Kernel
                                       9.   Spearman Footrule




                                   Different input parameters


                                   Ver y different outcomes!!!
Clustering Evaluation
Definition of cluster:
“A group of the same or similar elements gathered or occurring closely
together”

    How do we evaluate if a set of clusters is good or not?

          “Clustering is in the eye of the beholder” [E. Castro, 2002]


    Two main categories:
➢
    Internal criterion : only based on the clustered data itself
➢
    External criterion : based on benchmarks of pre-classified items
Internal Evaluation
Common goal is assigning better scores when:
➢
  High intra-cluster similarity
➢
  Low inter-cluster similarity

 The choice of the evaluation technique depends on the
nature of the data and the cluster model of the algorithm.


    Cluster models:
➢
    Distance-based (k-means)
➢
    Density-based (EM-clustering)
➢
    Distribution-based (DBSCAN)
➢
    Connectivity-based (linkage clustering)
Proposed techniques
AIC: measure of the relative quantity of lost information of a statistical
model. The clustering algorithm is modelled as a Gaussian Mixture Process.
(inverted function)




Dunn: ratio between the minimum inter-clusters similarity and maximum
cluster diameter. (natural fn.)

Davies-Bouldin : average similarity between each cluster and its most
similar one. (inverted fn.)

Silhouette: measure of how well each point lies within its cluster. Indicates
if the object is correctly clustered or if it would be more appropriate into the
neighbouring cluster. (natural fn.)
External criterion:
AdjustedRand
Given a a set of n elements S = {o1,...,on} and two partitions to compare:
X={X1,...,Xr} and Y={Y1,...,Ys}

               number of agreements between X and Y
 RandIndex =
               total number of possible pair combinations


                       RandIndex−ExpectedIndex
AdjustedRandIndex=
                       MaxIndex−ExpectedIndex



We can use AdjustedRand as reference of the best clustering evaluation and
use it as validation for the internal criterion.
Correlation t-test
                       Pearson correlation over a set of 120
                         random k-means configuration
                                  evaluations:




                     Average correlations:

                     AIC : 0.77
                     Dunn: 0.49
                     Davies-Bouldin: 0.51
                     Silhouette: 0.49
Dataset
                                                D31
                                                3100 vectors
                                                2 dimensions
                                                31 clusters




S1
5000 vectors
2 dimensions
15 clusters

               Source: http://cs.joensuu.fi/sipu/datasets/
Initial Centroids issue
N. observations = 200
Input Configuration: k = 31 , Distance Measure = Eclidean

        AdjustedRand                                   AIC




We can consider the median value!
Full space evaluation
N executions averaged = 20




                             Global optimal is for:
                             K = 36
                             DistanceMeasure = Euclidean
Genetic Algorithm Tuning
                                        Crossovering:
                                             [x1,x2,x3,x4,...,xm]

                                            [y1,y2,y3,y4,...,ym]
                  Elitism
                     +
               Roulette wheel

                                             [x1,x2,x3,y4,...,ym]
                                            [y1,y2,y3,x4,...,xm]


                                        Mutation:
                                                                1
                                Pr (mutate k i →k j )∝
                                                         distance ( k i , k j )

                                                              1
                                Pr (mutate d i →d j )=
                                                          N dist −1
Tuning parameters:
Fitness Evaluation : AIC
Prob. mutation: 0.5
Prob. Crossovering: 0.9
Population size: 6
Stagnation limit: 5
Elitism: 1
N executions averaged: 10




    Relevant results:
➢
    Best fitness value always decreasing
➢
    Mean fitness value trend decreasing
➢
    High standard deviation in the previous
    population often generates a better mean
    population in the next one
Results

Test1:
k = 39, Distance Measure = Manhattan

Test2:
k = 33, Distance Measure = RBF Kernel

Test3:
k = 36, Distance Measure = Euclidean




Different results due to:
1. Early convergence
2. Random initial centroids
Parallel GA
 Simulation:                               Amazon Elastic Compute Cloud EC2
 10 evolutions, POP_SIZE = 5, no elitism   10 x Micro instances




Optimal n. of ser vers = POP_SIZE – ELITISM

E[T single evolution] ≤
Conclusions
We developed, tested and analysed TunUp, an open-solution for:
Evaluation, Validation , Tuning of Data Clustering Algorithms

Future applications :
➢
  Tuning of existing algorithms
➢
  Supporting new algorithms design
➢
  Evaluation and comparison of different algorithms

Limitations:
➢
  Single distance measure
➢
  Equal normalization
➢
  Master / slave parallel execution
➢
  Random initial centroids
Questions?
Thank you! Tack! Grazie!

Contenu connexe

Tendances

Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines SimplyEmad Nabil
 
K-Means Clustering Simply
K-Means Clustering SimplyK-Means Clustering Simply
K-Means Clustering SimplyEmad Nabil
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackarogozhnikov
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningTaehoon Kim
 
K means clustering
K means clusteringK means clustering
K means clusteringAhmedasbasb
 
[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributionsWooSung Choi
 
Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEParogozhnikov
 
Time series clustering presentation
Time series clustering presentationTime series clustering presentation
Time series clustering presentationEleni Stamatelou
 
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptxQingsong Guo
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Atsushi Nitanda
 
Gan seminar
Gan seminarGan seminar
Gan seminarSan Kim
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsAjay Bidyarthy
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpankit_ppt
 
8 ijaems jan-2016-20-multi-attribute group decision making of internet public...
8 ijaems jan-2016-20-multi-attribute group decision making of internet public...8 ijaems jan-2016-20-multi-attribute group decision making of internet public...
8 ijaems jan-2016-20-multi-attribute group decision making of internet public...INFOGAIN PUBLICATION
 

Tendances (18)

Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
K-Means Clustering Simply
K-Means Clustering SimplyK-Means Clustering Simply
K-Means Clustering Simply
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Rough K Means - Numerical Example
Rough K Means - Numerical ExampleRough K Means - Numerical Example
Rough K Means - Numerical Example
 
[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions
 
Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEP
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
Time series clustering presentation
Time series clustering presentationTime series clustering presentation
Time series clustering presentation
 
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptx
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation Algorithms
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
 
8 ijaems jan-2016-20-multi-attribute group decision making of internet public...
8 ijaems jan-2016-20-multi-attribute group decision making of internet public...8 ijaems jan-2016-20-multi-attribute group decision making of internet public...
8 ijaems jan-2016-20-multi-attribute group decision making of internet public...
 

En vedette

The Beethoven Frieze
The Beethoven FriezeThe Beethoven Frieze
The Beethoven Friezeguimera
 
Fund Raising: A Ladder for Corporate GrowthFund raising
Fund Raising: A Ladder for Corporate GrowthFund raisingFund Raising: A Ladder for Corporate GrowthFund raising
Fund Raising: A Ladder for Corporate GrowthFund raisingPavan Kumar Vijay
 
Make your team less hierarchical
Make your team less hierarchicalMake your team less hierarchical
Make your team less hierarchicalPaolo Venerucci
 
TouchID, Handoff, Spotlight oraz Multitasking: Nowości W Projektowaniu Interf...
TouchID, Handoff, Spotlight oraz Multitasking: Nowości W Projektowaniu Interf...TouchID, Handoff, Spotlight oraz Multitasking: Nowości W Projektowaniu Interf...
TouchID, Handoff, Spotlight oraz Multitasking: Nowości W Projektowaniu Interf...Maciej Kołek
 
A short history of drug use according to Pete
A short history of drug use according to PeteA short history of drug use according to Pete
A short history of drug use according to PetePeteLees
 
Basics of the Federal Deposit Insurance Corporation
Basics of the Federal Deposit Insurance CorporationBasics of the Federal Deposit Insurance Corporation
Basics of the Federal Deposit Insurance CorporationGlobal Client Solutions
 
Needle Founders & Culture code
Needle Founders & Culture code Needle Founders & Culture code
Needle Founders & Culture code Rupam Gogoi
 
Michael Gage SOED 2016
Michael Gage SOED 2016Michael Gage SOED 2016
Michael Gage SOED 2016Colleen Ganley
 
Yahya Almalki SOED 2016
Yahya Almalki SOED 2016Yahya Almalki SOED 2016
Yahya Almalki SOED 2016Colleen Ganley
 
Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...
Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...
Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...Mattcartmell
 

En vedette (14)

The Beethoven Frieze
The Beethoven FriezeThe Beethoven Frieze
The Beethoven Frieze
 
Fund Raising: A Ladder for Corporate GrowthFund raising
Fund Raising: A Ladder for Corporate GrowthFund raisingFund Raising: A Ladder for Corporate GrowthFund raising
Fund Raising: A Ladder for Corporate GrowthFund raising
 
Make your team less hierarchical
Make your team less hierarchicalMake your team less hierarchical
Make your team less hierarchical
 
Lamb day
Lamb dayLamb day
Lamb day
 
306 - Lesson 1 - History of Comics
306 - Lesson 1 - History of Comics306 - Lesson 1 - History of Comics
306 - Lesson 1 - History of Comics
 
TouchID, Handoff, Spotlight oraz Multitasking: Nowości W Projektowaniu Interf...
TouchID, Handoff, Spotlight oraz Multitasking: Nowości W Projektowaniu Interf...TouchID, Handoff, Spotlight oraz Multitasking: Nowości W Projektowaniu Interf...
TouchID, Handoff, Spotlight oraz Multitasking: Nowości W Projektowaniu Interf...
 
A short history of drug use according to Pete
A short history of drug use according to PeteA short history of drug use according to Pete
A short history of drug use according to Pete
 
Basics of the Federal Deposit Insurance Corporation
Basics of the Federal Deposit Insurance CorporationBasics of the Federal Deposit Insurance Corporation
Basics of the Federal Deposit Insurance Corporation
 
Needle Founders & Culture code
Needle Founders & Culture code Needle Founders & Culture code
Needle Founders & Culture code
 
Michael Gage SOED 2016
Michael Gage SOED 2016Michael Gage SOED 2016
Michael Gage SOED 2016
 
Yahya Almalki SOED 2016
Yahya Almalki SOED 2016Yahya Almalki SOED 2016
Yahya Almalki SOED 2016
 
Jacob von Uexkull
Jacob von UexkullJacob von Uexkull
Jacob von Uexkull
 
Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...
Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...
Reputation – A Critical Driver of Business Value, by Ian Wright MPRCA, Corpor...
 
Crowdfunding: wie niet vraagt, niet wint
Crowdfunding: wie niet vraagt, niet wintCrowdfunding: wie niet vraagt, niet wint
Crowdfunding: wie niet vraagt, niet wint
 

Similaire à TunUp final presentation

MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1arogozhnikov
 
11ClusAdvanced.ppt
11ClusAdvanced.ppt11ClusAdvanced.ppt
11ClusAdvanced.pptSueMiu
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptChapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptSubrata Kumer Paul
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Salah Amean
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Daniel Chan
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdfRahul926331
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAminaRepo
 

Similaire à TunUp final presentation (20)

Lect4
Lect4Lect4
Lect4
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
11ClusAdvanced.ppt
11ClusAdvanced.ppt11ClusAdvanced.ppt
11ClusAdvanced.ppt
 
Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptChapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.ppt
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
Interactive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social GraphsInteractive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social Graphs
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
Data analysis of weather forecasting
Data analysis of weather forecastingData analysis of weather forecasting
Data analysis of weather forecasting
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
Project PPT
Project PPTProject PPT
Project PPT
 
Cs345 cl
Cs345 clCs345 cl
Cs345 cl
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 

Plus de Gianmario Spacagna

Latent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case studyLatent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case studyGianmario Spacagna
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsGianmario Spacagna
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsGianmario Spacagna
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersGianmario Spacagna
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupGianmario Spacagna
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Gianmario Spacagna
 
Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel Tuning of Machine Learning Algorithms, Thesis ProposalParallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel Tuning of Machine Learning Algorithms, Thesis ProposalGianmario Spacagna
 

Plus de Gianmario Spacagna (8)

Latent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case studyLatent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case study
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning products
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-Encoders
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...
 
Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel Tuning of Machine Learning Algorithms, Thesis ProposalParallel Tuning of Machine Learning Algorithms, Thesis Proposal
Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
 

Dernier

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Dernier (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

TunUp final presentation

  • 1. TunUp: A Distributed Cloud-based Genetic Evolutionary Tuning for Data Clustering Gianmario Spacagna gm.spacagna@gmail.com March 2013 AgilOne, Inc. 1091 N Shoreline Blvd. #250 Mountain View, CA 94043
  • 2. Agenda 1. Introduction 2. Problem description 3. TunUp 4. K-means 5. Clustering evaluation 6. Full space tuning 7. Genetic algorithm tuning 8. Conclusions
  • 4. Business Intelligence Why ? Where? What? How? Insights of customers, products and companies Can someone else know your customer better than you? Do you have the domain knowledge and proper computation infrastructure?
  • 5. Big Data as a Service (BDaaS)
  • 6. Problem Description income cost customers
  • 7. Tuning of Clustering Algorithms We need tuning when: ➢ New algorithm or version is released ➢ We want to improve accuracy and/or performance ➢ New customer comes and the system must be adapted for the new dataset and requirements 9
  • 8. TunUp Java framework integrating JavaML and Watchmaker Main features: ➢ Data manipulation (loading, labelling and normalization) ➢ Clustering algorithms (k-means) ➢ Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand) ➢ Evaluation techniques validation (Pearson Correlation t-test) ➢ Full search space tuning ➢ Genetic Algorithm tuning (local and parallel implementation) ➢ RESTful API for web ser vice deployment (tomcat in Amazon EC2) Open-source: http://github.com/gm-spacagna/tunup
  • 9. k-means Geometric hard-assigning Clustering algorithm: It partitions n data points into k clusters in which each point belongs to the cluster with the nearest mean centroid. If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares: Algorithm: 1. Initialization : a set of k random centroids are generated 2. Assignment: each point is assigned to the closest centroid 3. Update: the new centroids are calculated as the mean of the new clusters 4. Go to 2 until the convergence (centroids are stable and do not change)
  • 10. k-means tuning Input parameters required: 0. Angular 2. Chebyshev 1. K = (2,...,40) 3. Cosine 4. Euclidean 2. Distance measure 5. Jaccard Index 6. Manhattan 3. Max iterations = 20 (fixed) 7. Pearson Correlation Coefficient 8. Radial Basis Function Kernel 9. Spearman Footrule Different input parameters Ver y different outcomes!!!
  • 11. Clustering Evaluation Definition of cluster: “A group of the same or similar elements gathered or occurring closely together” How do we evaluate if a set of clusters is good or not? “Clustering is in the eye of the beholder” [E. Castro, 2002] Two main categories: ➢ Internal criterion : only based on the clustered data itself ➢ External criterion : based on benchmarks of pre-classified items
  • 12. Internal Evaluation Common goal is assigning better scores when: ➢ High intra-cluster similarity ➢ Low inter-cluster similarity The choice of the evaluation technique depends on the nature of the data and the cluster model of the algorithm. Cluster models: ➢ Distance-based (k-means) ➢ Density-based (EM-clustering) ➢ Distribution-based (DBSCAN) ➢ Connectivity-based (linkage clustering)
  • 13. Proposed techniques AIC: measure of the relative quantity of lost information of a statistical model. The clustering algorithm is modelled as a Gaussian Mixture Process. (inverted function) Dunn: ratio between the minimum inter-clusters similarity and maximum cluster diameter. (natural fn.) Davies-Bouldin : average similarity between each cluster and its most similar one. (inverted fn.) Silhouette: measure of how well each point lies within its cluster. Indicates if the object is correctly clustered or if it would be more appropriate into the neighbouring cluster. (natural fn.)
  • 14. External criterion: AdjustedRand Given a a set of n elements S = {o1,...,on} and two partitions to compare: X={X1,...,Xr} and Y={Y1,...,Ys} number of agreements between X and Y RandIndex = total number of possible pair combinations RandIndex−ExpectedIndex AdjustedRandIndex= MaxIndex−ExpectedIndex We can use AdjustedRand as reference of the best clustering evaluation and use it as validation for the internal criterion.
  • 15. Correlation t-test Pearson correlation over a set of 120 random k-means configuration evaluations: Average correlations: AIC : 0.77 Dunn: 0.49 Davies-Bouldin: 0.51 Silhouette: 0.49
  • 16. Dataset D31 3100 vectors 2 dimensions 31 clusters S1 5000 vectors 2 dimensions 15 clusters Source: http://cs.joensuu.fi/sipu/datasets/
  • 17. Initial Centroids issue N. observations = 200 Input Configuration: k = 31 , Distance Measure = Eclidean AdjustedRand AIC We can consider the median value!
  • 18. Full space evaluation N executions averaged = 20 Global optimal is for: K = 36 DistanceMeasure = Euclidean
  • 19. Genetic Algorithm Tuning Crossovering: [x1,x2,x3,x4,...,xm] [y1,y2,y3,y4,...,ym] Elitism + Roulette wheel [x1,x2,x3,y4,...,ym] [y1,y2,y3,x4,...,xm] Mutation: 1 Pr (mutate k i →k j )∝ distance ( k i , k j ) 1 Pr (mutate d i →d j )= N dist −1
  • 20. Tuning parameters: Fitness Evaluation : AIC Prob. mutation: 0.5 Prob. Crossovering: 0.9 Population size: 6 Stagnation limit: 5 Elitism: 1 N executions averaged: 10 Relevant results: ➢ Best fitness value always decreasing ➢ Mean fitness value trend decreasing ➢ High standard deviation in the previous population often generates a better mean population in the next one
  • 21. Results Test1: k = 39, Distance Measure = Manhattan Test2: k = 33, Distance Measure = RBF Kernel Test3: k = 36, Distance Measure = Euclidean Different results due to: 1. Early convergence 2. Random initial centroids
  • 22. Parallel GA Simulation: Amazon Elastic Compute Cloud EC2 10 evolutions, POP_SIZE = 5, no elitism 10 x Micro instances Optimal n. of ser vers = POP_SIZE – ELITISM E[T single evolution] ≤
  • 23. Conclusions We developed, tested and analysed TunUp, an open-solution for: Evaluation, Validation , Tuning of Data Clustering Algorithms Future applications : ➢ Tuning of existing algorithms ➢ Supporting new algorithms design ➢ Evaluation and comparison of different algorithms Limitations: ➢ Single distance measure ➢ Equal normalization ➢ Master / slave parallel execution ➢ Random initial centroids
  • 25. Thank you! Tack! Grazie!