SlideShare une entreprise Scribd logo
1  sur  37
Nearest Neighbor based approaches to Multivariate Data Analysis Tim Hare
We can measure a multivariate  item’s  similarity to other items (n) via its distance from other ITEMS in variable (p) space ,[object Object],[object Object],[object Object],p n
Nearest Neighbor Searching Locate the nearest multivariate neighbors in p-space ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Clustering  Approaches ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Non-Hierarchical Divisive ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hierarchical Agglomerative Clustering ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Distance is not enough to deal with objects that have dimension themselves: “LINKAGE” ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
SINGLE vs COMPLETE linkage (PROC CLUSTER  Method = Single/Complete ) ,[object Object],[object Object],min(S,Q)= max(S,Q)= CHAINING  during single linkage clustering : one of the few ways to delineate non-ellipsoidal clusters but can be misleading in that  items on opposite ends of the clusters are likely to be quite different Resulting Clusters Single Linkages
AVERAGE linkage (PROC CLUSTER Method = AVERAGE) ,[object Object],[object Object],[ d(A1,B1),d(A1,B2),d(A1,B3) d(A2,B1),d(A2,B2),d(A2,B3) d(A3,B1),d(A3,B2),d(A3,B3) ]/9
Ward’s Method   (PROC CLUSTER  METHOD=WARD ) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
SAS options for Data Normalization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
PROC ACECLUS output from  Poverty Data set (p=3) :  QQ-PLOTS to check MVN on transformed variables (can1, can2, can3) which is needed for Ward’s method.  Rq(can1)=0.951, Rq(can2)=0.981, Rq(can3)=0.976, where n=97 and  RqCP=0.9895  at  α =0.1 A more thorough investigation would involve  outlier detection  and removal as well as data  transform testing  (BOX-COX)
Minimal code needed for a cluster analysis Generate a data set with only the resulting clustering # we wish to examine for use in PLOTTING, if needed Sampling proportion: try values from 0.01 to 0.5
PROC TREE output: how many clusters do we think are appropriate? (Distance criteria and value at time of merger on horizontal axis) Ward’s ? Average
 
Pseudo-F Statistic  Plot Interpretation
Pseudo-T2 Statistic  Plot Interpretation
Comparison of CCC, Pseudo-F, Pseudo-T2 under different clustering runs varying distance, linkage and normalization If we didn’t have a   low dimensional  variable set (p=3) it would be impossible to build a case on AVERAGE- and SIMPLE linkage Euclidian Dist, AVG linkage, Aceclus Normalized ? Ward Linkage, Aceclus Normalized What we want to see. Simple Linkage, Aceclus Normalized ?
Birth Rate vs Death Rate  Notice the evidence for the known  bias in Ward to equal numbers  of observations per cluster where as with AVG the process allows us to have some small clusters in the lower right.  The Expected Maximum Likelihood  (EML) method  in PROC CLUSTER produces similar results to Ward’s method, but with  a slight bias in the opposite direction  toward clusters of unequal sizes.   Ward linkage, ACECLUS norm Euclidian dist, AVG linkage, ACECLUS norm
Birth Rate vs Infant-Death Rate Ward linkage, ACECLUS norm Euclidian dist, AVG linkage, ACECLUS norm
DeathRate vs InfantDeath Rate Ward linkage, ACECLUS norm Euclidian dist, AVG linkage, ACECLUS norm
Lessons learned?  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Q & A
Here’s an example of the risk of “bad” Hierarchical Agglomerative clustering early on: small run on 8 items shows us divergence in cluster membership.  If the final cluster number were 4, then we’d have different results from these two runs.  Which would be best?   Slight difference in clustering with a robust approach but bad approaches can result in significant differences that will not be undone as Hierarchical Agglomerative clustering proceeds.
MVN and outlier sensitivity of Ward’s linkage:  Test on a small 4 item sample to show the effect of clustering with ACECLUS normalization (left) and NO normalization (right) under Ward’s linkage method: clustering is somewhat different.
Method = WARD in PROC CLUSTER  (P692-693, Dean & Wichern) in Proc Cluster
Ward’s + Aceclus ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
We need a stopping criteria:  what is the best number of clusters to use?   Don’t want too few &/or a RISE in SPRSQ   Large jump in SPRSQ Small increase in SPRSQ  Intermediate increase in SPRSQ
How to interpret the Proc Cluster   RAW  Output:  cluster NAME and PARENT cluster columns can be interpreted as noted below… Bulgaria+Czechoslovakia   C3 FormerEGermany+C3   C2 Albania+C2    C1
SPRSQ: SAS Cluster Output ,[object Object],[object Object],[object Object]
How to interpret the Proc Tree RAW  output:  focus on CLUSTER & CLUSTERNAME Cluster 1 event forms CL3,  Cluster 2 event adds FEG, Cluster 3 event adds Albania
Prior to clustering we’ll use  PROC ACECLUS  to generate normalized variables: Can1~BirthRate, Can2~DeathRate, Can3~InfantDeathRate
True  Distance* Measures between Items are preferable in Clustering** but not always possible (e.g. binary variables) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Mahalanbis Distance ,[object Object],[object Object]
Minkowski Distance m=1, sum of absolute values, or “City Block” distance m=2, sum of squares, or Euclidian distance
SAS CODE for Clustering ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Contenu connexe

Tendances (20)

Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
Clustering
ClusteringClustering
Clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
08 clustering
08 clustering08 clustering
08 clustering
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
My8clst
My8clstMy8clst
My8clst
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Data clustering
Data clustering Data clustering
Data clustering
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
Rajia cluster analysis
Rajia cluster analysisRajia cluster analysis
Rajia cluster analysis
 
Dbm630 lecture09
Dbm630 lecture09Dbm630 lecture09
Dbm630 lecture09
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Clustering
ClusteringClustering
Clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 

En vedette

Cluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasCluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasMadhumita Ghosh
 
RapidMiner: Visualization Capabilities
RapidMiner:  Visualization CapabilitiesRapidMiner:  Visualization Capabilities
RapidMiner: Visualization CapabilitiesDataminingTools Inc
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentationVishal Tandel
 
RapidMiner: Performance Validation And Visualization
RapidMiner:  Performance Validation And VisualizationRapidMiner:  Performance Validation And Visualization
RapidMiner: Performance Validation And VisualizationRapidmining Content
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
 
RapidMiner: Introduction To Rapid Miner
RapidMiner: Introduction To Rapid MinerRapidMiner: Introduction To Rapid Miner
RapidMiner: Introduction To Rapid MinerRapidmining Content
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
RapidMiner: Performance Validation And Visualization
RapidMiner: Performance Validation And VisualizationRapidMiner: Performance Validation And Visualization
RapidMiner: Performance Validation And VisualizationDataminingTools Inc
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Edureka!
 
エクセルで統計分析 統計プログラムHADについて
エクセルで統計分析 統計プログラムHADについてエクセルで統計分析 統計プログラムHADについて
エクセルで統計分析 統計プログラムHADについてHiroshi Shimizu
 
State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Br...
State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Br...State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Br...
State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Br...Vincenzo Russo
 

En vedette (11)

Cluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasCluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and Sas
 
RapidMiner: Visualization Capabilities
RapidMiner:  Visualization CapabilitiesRapidMiner:  Visualization Capabilities
RapidMiner: Visualization Capabilities
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentation
 
RapidMiner: Performance Validation And Visualization
RapidMiner:  Performance Validation And VisualizationRapidMiner:  Performance Validation And Visualization
RapidMiner: Performance Validation And Visualization
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
RapidMiner: Introduction To Rapid Miner
RapidMiner: Introduction To Rapid MinerRapidMiner: Introduction To Rapid Miner
RapidMiner: Introduction To Rapid Miner
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
RapidMiner: Performance Validation And Visualization
RapidMiner: Performance Validation And VisualizationRapidMiner: Performance Validation And Visualization
RapidMiner: Performance Validation And Visualization
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 
エクセルで統計分析 統計プログラムHADについて
エクセルで統計分析 統計プログラムHADについてエクセルで統計分析 統計プログラムHADについて
エクセルで統計分析 統計プログラムHADについて
 
State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Br...
State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Br...State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Br...
State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Br...
 

Similaire à Statistical Clustering

SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
Subspace clustring
Subspace clustringSubspace clustring
Subspace clustringNabil Sy
 
Pm m23 & pmnm06 week 3 lectures 2015
Pm m23 & pmnm06 week 3 lectures 2015Pm m23 & pmnm06 week 3 lectures 2015
Pm m23 & pmnm06 week 3 lectures 2015pdiddyboy2
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 
Assessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersAssessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersperfj
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10mqasimsheikh5
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
Ch 4 Cluster Analysis.pdf
Ch 4 Cluster Analysis.pdfCh 4 Cluster Analysis.pdf
Ch 4 Cluster Analysis.pdfYaseenRashid4
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Read first few slides cluster analysis
Read first few slides cluster analysisRead first few slides cluster analysis
Read first few slides cluster analysisKritika Jain
 
Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01deepti gupta
 

Similaire à Statistical Clustering (20)

SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
Subspace clustring
Subspace clustringSubspace clustring
Subspace clustring
 
Pm m23 & pmnm06 week 3 lectures 2015
Pm m23 & pmnm06 week 3 lectures 2015Pm m23 & pmnm06 week 3 lectures 2015
Pm m23 & pmnm06 week 3 lectures 2015
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
Assessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersAssessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clusters
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Ch 4 Cluster Analysis.pdf
Ch 4 Cluster Analysis.pdfCh 4 Cluster Analysis.pdf
Ch 4 Cluster Analysis.pdf
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Read first few slides cluster analysis
Read first few slides cluster analysisRead first few slides cluster analysis
Read first few slides cluster analysis
 
Clusteranalysis
Clusteranalysis Clusteranalysis
Clusteranalysis
 
Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Statistical Clustering

  • 1. Nearest Neighbor based approaches to Multivariate Data Analysis Tim Hare
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. PROC ACECLUS output from Poverty Data set (p=3) : QQ-PLOTS to check MVN on transformed variables (can1, can2, can3) which is needed for Ward’s method. Rq(can1)=0.951, Rq(can2)=0.981, Rq(can3)=0.976, where n=97 and RqCP=0.9895 at α =0.1 A more thorough investigation would involve outlier detection and removal as well as data transform testing (BOX-COX)
  • 13. Minimal code needed for a cluster analysis Generate a data set with only the resulting clustering # we wish to examine for use in PLOTTING, if needed Sampling proportion: try values from 0.01 to 0.5
  • 14. PROC TREE output: how many clusters do we think are appropriate? (Distance criteria and value at time of merger on horizontal axis) Ward’s ? Average
  • 15.  
  • 16. Pseudo-F Statistic Plot Interpretation
  • 17. Pseudo-T2 Statistic Plot Interpretation
  • 18. Comparison of CCC, Pseudo-F, Pseudo-T2 under different clustering runs varying distance, linkage and normalization If we didn’t have a low dimensional variable set (p=3) it would be impossible to build a case on AVERAGE- and SIMPLE linkage Euclidian Dist, AVG linkage, Aceclus Normalized ? Ward Linkage, Aceclus Normalized What we want to see. Simple Linkage, Aceclus Normalized ?
  • 19. Birth Rate vs Death Rate Notice the evidence for the known bias in Ward to equal numbers of observations per cluster where as with AVG the process allows us to have some small clusters in the lower right. The Expected Maximum Likelihood (EML) method in PROC CLUSTER produces similar results to Ward’s method, but with a slight bias in the opposite direction toward clusters of unequal sizes. Ward linkage, ACECLUS norm Euclidian dist, AVG linkage, ACECLUS norm
  • 20. Birth Rate vs Infant-Death Rate Ward linkage, ACECLUS norm Euclidian dist, AVG linkage, ACECLUS norm
  • 21. DeathRate vs InfantDeath Rate Ward linkage, ACECLUS norm Euclidian dist, AVG linkage, ACECLUS norm
  • 22.
  • 23. Q & A
  • 24. Here’s an example of the risk of “bad” Hierarchical Agglomerative clustering early on: small run on 8 items shows us divergence in cluster membership. If the final cluster number were 4, then we’d have different results from these two runs. Which would be best? Slight difference in clustering with a robust approach but bad approaches can result in significant differences that will not be undone as Hierarchical Agglomerative clustering proceeds.
  • 25. MVN and outlier sensitivity of Ward’s linkage: Test on a small 4 item sample to show the effect of clustering with ACECLUS normalization (left) and NO normalization (right) under Ward’s linkage method: clustering is somewhat different.
  • 26. Method = WARD in PROC CLUSTER (P692-693, Dean & Wichern) in Proc Cluster
  • 27.
  • 28.  
  • 29. We need a stopping criteria: what is the best number of clusters to use? Don’t want too few &/or a RISE in SPRSQ Large jump in SPRSQ Small increase in SPRSQ Intermediate increase in SPRSQ
  • 30. How to interpret the Proc Cluster RAW Output: cluster NAME and PARENT cluster columns can be interpreted as noted below… Bulgaria+Czechoslovakia  C3 FormerEGermany+C3  C2 Albania+C2  C1
  • 31.
  • 32. How to interpret the Proc Tree RAW output: focus on CLUSTER & CLUSTERNAME Cluster 1 event forms CL3, Cluster 2 event adds FEG, Cluster 3 event adds Albania
  • 33. Prior to clustering we’ll use PROC ACECLUS to generate normalized variables: Can1~BirthRate, Can2~DeathRate, Can3~InfantDeathRate
  • 34.
  • 35.
  • 36. Minkowski Distance m=1, sum of absolute values, or “City Block” distance m=2, sum of squares, or Euclidian distance
  • 37.