Statistical Clustering

Nearest Neighbor based approaches to Multivariate Data Analysis Tim Hare

We can measure a multivariate item’s similarity to other items (n) via its distance from other ITEMS in variable (p) space ,[object Object],[object Object],[object Object],p n

Nearest Neighbor Searching Locate the nearest multivariate neighbors in p-space ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Clustering Approaches ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Non-Hierarchical Divisive ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Hierarchical Agglomerative Clustering ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Distance is not enough to deal with objects that have dimension themselves: “LINKAGE” ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

SINGLE vs COMPLETE linkage (PROC CLUSTER Method = Single/Complete ) ,[object Object],[object Object],min(S,Q)= max(S,Q)= CHAINING during single linkage clustering : one of the few ways to delineate non-ellipsoidal clusters but can be misleading in that items on opposite ends of the clusters are likely to be quite different Resulting Clusters Single Linkages

AVERAGE linkage (PROC CLUSTER Method = AVERAGE) ,[object Object],[object Object],[ d(A1,B1),d(A1,B2),d(A1,B3) d(A2,B1),d(A2,B2),d(A2,B3) d(A3,B1),d(A3,B2),d(A3,B3) ]/9

Ward’s Method (PROC CLUSTER METHOD=WARD ) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

SAS options for Data Normalization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

PROC ACECLUS output from Poverty Data set (p=3) : QQ-PLOTS to check MVN on transformed variables (can1, can2, can3) which is needed for Ward’s method. Rq(can1)=0.951, Rq(can2)=0.981, Rq(can3)=0.976, where n=97 and RqCP=0.9895 at α =0.1 A more thorough investigation would involve outlier detection and removal as well as data transform testing (BOX-COX)

Minimal code needed for a cluster analysis Generate a data set with only the resulting clustering # we wish to examine for use in PLOTTING, if needed Sampling proportion: try values from 0.01 to 0.5

PROC TREE output: how many clusters do we think are appropriate? (Distance criteria and value at time of merger on horizontal axis) Ward’s ? Average

Pseudo-F Statistic Plot Interpretation

Pseudo-T2 Statistic Plot Interpretation

Comparison of CCC, Pseudo-F, Pseudo-T2 under different clustering runs varying distance, linkage and normalization If we didn’t have a low dimensional variable set (p=3) it would be impossible to build a case on AVERAGE- and SIMPLE linkage Euclidian Dist, AVG linkage, Aceclus Normalized ? Ward Linkage, Aceclus Normalized What we want to see. Simple Linkage, Aceclus Normalized ?

Birth Rate vs Death Rate Notice the evidence for the known bias in Ward to equal numbers of observations per cluster where as with AVG the process allows us to have some small clusters in the lower right. The Expected Maximum Likelihood (EML) method in PROC CLUSTER produces similar results to Ward’s method, but with a slight bias in the opposite direction toward clusters of unequal sizes. Ward linkage, ACECLUS norm Euclidian dist, AVG linkage, ACECLUS norm

Birth Rate vs Infant-Death Rate Ward linkage, ACECLUS norm Euclidian dist, AVG linkage, ACECLUS norm

DeathRate vs InfantDeath Rate Ward linkage, ACECLUS norm Euclidian dist, AVG linkage, ACECLUS norm

Lessons learned? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Here’s an example of the risk of “bad” Hierarchical Agglomerative clustering early on: small run on 8 items shows us divergence in cluster membership. If the final cluster number were 4, then we’d have different results from these two runs. Which would be best? Slight difference in clustering with a robust approach but bad approaches can result in significant differences that will not be undone as Hierarchical Agglomerative clustering proceeds.

MVN and outlier sensitivity of Ward’s linkage: Test on a small 4 item sample to show the effect of clustering with ACECLUS normalization (left) and NO normalization (right) under Ward’s linkage method: clustering is somewhat different.

Method = WARD in PROC CLUSTER (P692-693, Dean & Wichern) in Proc Cluster

Ward’s + Aceclus ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

We need a stopping criteria: what is the best number of clusters to use? Don’t want too few &/or a RISE in SPRSQ Large jump in SPRSQ Small increase in SPRSQ Intermediate increase in SPRSQ

How to interpret the Proc Cluster RAW Output: cluster NAME and PARENT cluster columns can be interpreted as noted below… Bulgaria+Czechoslovakia  C3 FormerEGermany+C3  C2 Albania+C2  C1

SPRSQ: SAS Cluster Output ,[object Object],[object Object],[object Object]

How to interpret the Proc Tree RAW output: focus on CLUSTER & CLUSTERNAME Cluster 1 event forms CL3, Cluster 2 event adds FEG, Cluster 3 event adds Albania

Prior to clustering we’ll use PROC ACECLUS to generate normalized variables: Can1~BirthRate, Can2~DeathRate, Can3~InfantDeathRate

True Distance* Measures between Items are preferable in Clustering** but not always possible (e.g. binary variables) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Mahalanbis Distance ,[object Object],[object Object]

Minkowski Distance m=1, sum of absolute values, or “City Block” distance m=2, sum of squares, or Euclidian distance

SAS CODE for Clustering ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Statistical Clustering

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (11)

Similaire à Statistical Clustering

Similaire à Statistical Clustering (20)

Dernier

Dernier (20)

Statistical Clustering