Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Presented by: Derek Kane
 Introduction to Clustering Techniques
 K-Means
 Hierarchical Clustering
 Gaussian Mixed Model
 Visualization of Dist...
 Cluster analysis or clustering is the task of
grouping a set of objects in such a way that
objects in the same group (ca...
There are many real world applications of clustering:
 Grouping or Hierarchies of Products
 Recommendation Engines
 Bio...
 According to Vladimir Estivill-Castro, the notion
of a "cluster" cannot be precisely defined, which
is one of the reason...
 A "clustering" is essentially a set of such
clusters, usually containing all objects in the
data set.
 Additionally, it...
 K-means is one of the simplest unsupervised
learning algorithms that solve the well known
clustering problem.
The algori...
 A scree plot is a graphical display of the
variance of each (cluster) component in
the dataset which is used to determin...
 A scree plot is sometimes referred to as an “elbow” plot.
 In order to identify the optimal number of clusters for furt...
 Silhouette refers to a method of interpretation and validation of clusters of data.
 The silhouette technique provides ...
 The goal of a silhouette plot is to identify the point of the highest Average Silhouette
Width (ASW). This is the optima...
 The Cluster plot shows the spread of the data within each of the generated clusters.
 A Principal Component Analysis (g...
 Here is an example of K-Means clustering based on the Ruspini data.
Cluster = 4
 Hierarchical clustering (or hierarchic
clustering ) outputs a hierarchy, a
structure that is more informative than
the u...
Given a set of N items to be clustered, and an NxN
distance (or similarity) matrix, the basic process of
Johnson's (1967) ...
Step 3 can be done in different ways, which is what
distinguishes single-linkage from complete-linkage and
average-linkage...
 Here is an example of Hierarchical clustering based on the Ruspini data.
Cluster = 4
 Another technique we can use for
clustering involves the application of the
EM algorithm for Gaussian Mixed Models.
 Th...
 Here are some diagnostics which are useful to establish the correct fit to the data.
The maximum BIC
indicates that a Cl...
 The Mclust classification chart depicts the 5 centroid distribution and also the relative
degree of the classification u...
 The contour plot shows the density of the data points relative to each of the
centroids.
 A similarity matrix is a matrix of scores
that represent the similarity between a
number of data points.
 Each element ...
Ex. Points on a graph
Distance Matrix
Heat map
 Here is an application of the Distance Matrix to the Ruspini dataset.
 Now we will reorder the distance matrix by the kmeans (k=4) algorithm.
 Now we bring in the dissplot to compare the heatmap against expected values.
We are looking for symmetry between the obs...
 This is an example of an incorrect
kmeans specification of k = 3 instead
of k = 4.
 The heatmap shows that within the
1...
 The goal of a clustering exercise should
be to identify the appropriate pockets or
groups of data that should be grouped...
 This dataset clearly shows that there should be 4 clusters: 2 eyes, 1 nose, and 1 mouth.
 Lets first run a k-means clustering algorithm:
K = 6 Clusters The k-means produced too many
clusters based off of the da...
 Now lets use the Hierarchical clustering method:
K = 4 Clusters
 Reside in Wayne, Illinois
 Active Semi-Professional Classical Musician
(Bassoon).
 Married my wife on 10/10/10 and bee...
 http://www.stat.washington.edu/mclust/
 http://www.stats.gla.ac.uk/glossary/?q=node/451
 http://www.norusis.com/pdf/SP...
Data Science - Part VII -  Cluster Analysis
Data Science - Part VII -  Cluster Analysis
Prochain SlideShare
Chargement dans…5
×

Data Science - Part VII - Cluster Analysis

4 841 vues

Publié le

This lecture provides an overview of clustering techniques, including K-Means, Hierarchical Clustering, and Gaussian Mixed Models. We will go through some methods of calibration and diagnostics and then apply the technique on a recognizable dataset.

Publié dans : Données & analyses
  • Thanks for the previous comments. www.HelpWriting.net helped me too
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Hello! I do no use writing service very often, only when I really have problems. But this one, I like best of all. The team of writers operates very quickly. It's called ⇒ www.HelpWriting.net ⇐ Hope this helps!
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • A professional Paper writing services can alleviate your stress in writing a successful paper and take the pressure off you to hand it in on time. Check out, please ⇒ www.WritePaper.info ⇐
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • I like this service ⇒ www.WritePaper.info ⇐ from Academic Writers. I don't have enough time write it by myself.
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Data Science - Part VII - Cluster Analysis

  1. 1. Presented by: Derek Kane
  2. 2.  Introduction to Clustering Techniques  K-Means  Hierarchical Clustering  Gaussian Mixed Model  Visualization of Distance Matrix  Practical Example
  3. 3.  Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).  It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.  Cluster analysis itself is not one specific algorithm, but the general task to be solved.
  4. 4. There are many real world applications of clustering:  Grouping or Hierarchies of Products  Recommendation Engines  Biological Classification  Typologies  Crime Analysis  Medical Imaging  Market Research  Social Network Analysis  Markov Chain Monte Carlo Methods
  5. 5.  According to Vladimir Estivill-Castro, the notion of a "cluster" cannot be precisely defined, which is one of the reasons why there are so many clustering algorithms.  There is no objectively "correct" clustering algorithm, but as it was noted, "clustering is in the eye of the beholder.“ There are many different types of clustering models including:  Connectivity models  Centroid models  Distribution models  Density models  Group models  Graph-based models
  6. 6.  A "clustering" is essentially a set of such clusters, usually containing all objects in the data set.  Additionally, it may specify the relationship of the clusters to each other, for example a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished as:  hard clustering: each object belongs to a cluster or not.  soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain degree (e.g. a likelihood of belonging to the cluster).
  7. 7.  K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The algorithm is composed of the following steps:  Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.  Assign each object to the group that has the closest centroid.  When all objects have been assigned, recalculate the positions of the K centroids.  Repeat Steps 2 and 3 until the centroids no longer move.  The k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum.  The algorithm is also significantly sensitive to the initial randomly selected cluster centers.
  8. 8.  A scree plot is a graphical display of the variance of each (cluster) component in the dataset which is used to determine how many components should be retained in order to explain a high percentage of the variation in the data.  The variance of each component is calculated using the following formula: λi∑i=1nλi where λi is the ith eigenvalue and ∑i=1nλi is the sum of all of the eigenvalues.  The plot shows the variance for the first component and then for the subsequent components, it shows the additional variance that each component is adding.
  9. 9.  A scree plot is sometimes referred to as an “elbow” plot.  In order to identify the optimal number of clusters for further analysis, we need to look for the “bend” in the graph at the elbow. # of Clusters should be 3 or 4
  10. 10.  Silhouette refers to a method of interpretation and validation of clusters of data.  The silhouette technique provides a succinct graphical representation of how well each object lies within its cluster and was first described by Peter J. Rousseeuw in 1986.
  11. 11.  The goal of a silhouette plot is to identify the point of the highest Average Silhouette Width (ASW). This is the optimal number of clusters for the analysis. Cluster = 4
  12. 12.  The Cluster plot shows the spread of the data within each of the generated clusters.  A Principal Component Analysis (greater than 2 dimensions) can be utilized to view the clustering.
  13. 13.  Here is an example of K-Means clustering based on the Ruspini data. Cluster = 4
  14. 14.  Hierarchical clustering (or hierarchic clustering ) outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by flat clustering.  Hierarchical clustering does not require us to prespecify the number of clusters and most hierarchical algorithms that have been used in IR are deterministic.  These advantages of hierarchical clustering come at the cost of lower efficiency.
  15. 15. Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of Johnson's (1967) hierarchical clustering is this:  Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain.  Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.  Compute distances (similarities) between the new cluster and each of the old clusters.  Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. Cluster = 4
  16. 16. Step 3 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering.  In single-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.  In complete-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.  In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster. Single-linkage on density- based clusters.
  17. 17.  Here is an example of Hierarchical clustering based on the Ruspini data. Cluster = 4
  18. 18.  Another technique we can use for clustering involves the application of the EM algorithm for Gaussian Mixed Models.  The approach uses the Mclust package in R that utilizes the BIC statistic as the measurement criteria.  The goal for identifying the number of clusters is to maximize the BIC.  This approach is considered to be a fuzzy clustering method.
  19. 19.  Here are some diagnostics which are useful to establish the correct fit to the data. The maximum BIC indicates that a Cluster = 5 should be used
  20. 20.  The Mclust classification chart depicts the 5 centroid distribution and also the relative degree of the classification uncertainty.
  21. 21.  The contour plot shows the density of the data points relative to each of the centroids.
  22. 22.  A similarity matrix is a matrix of scores that represent the similarity between a number of data points.  Each element of the similarity matrix contains a measure of similarity between two of the data points.  A distance matrix is a matrix (two- dimensional array) containing the distances, taken pairwise, of a set of points.  This matrix will have a size of N×N where N is the number of points, nodes or vertices (often in a graph). Heat map of Similarity Matrix
  23. 23. Ex. Points on a graph Distance Matrix Heat map
  24. 24.  Here is an application of the Distance Matrix to the Ruspini dataset.
  25. 25.  Now we will reorder the distance matrix by the kmeans (k=4) algorithm.
  26. 26.  Now we bring in the dissplot to compare the heatmap against expected values. We are looking for symmetry between the observed colors and the expected colors.
  27. 27.  This is an example of an incorrect kmeans specification of k = 3 instead of k = 4.  The heatmap shows that within the 1st cluster there appears to be datapoints which should be separated into an additional distinct cluster.  The similarity and distance matrices can be powerful visual aids when evaluating the fit of clusters.
  28. 28.  The goal of a clustering exercise should be to identify the appropriate pockets or groups of data that should be grouped together.  Because the structure of the underlying data is usually unknown, there could be many different clusters created by different clustering techniques.  Therefore, lets use a dataset where it is obvious what the correct number of clusters should be.
  29. 29.  This dataset clearly shows that there should be 4 clusters: 2 eyes, 1 nose, and 1 mouth.
  30. 30.  Lets first run a k-means clustering algorithm: K = 6 Clusters The k-means produced too many clusters based off of the dataset.
  31. 31.  Now lets use the Hierarchical clustering method: K = 4 Clusters
  32. 32.  Reside in Wayne, Illinois  Active Semi-Professional Classical Musician (Bassoon).  Married my wife on 10/10/10 and been together for 10 years.  Pet Yorkshire Terrier / Toy Poodle named Brunzie.  Pet Maine Coons’ named Maximus Power and Nemesis Gul du Cat.  Enjoy Cooking, Hiking, Cycling, Kayaking, and Astronomy.  Self proclaimed Data Nerd and Technology Lover.
  33. 33.  http://www.stat.washington.edu/mclust/  http://www.stats.gla.ac.uk/glossary/?q=node/451  http://www.norusis.com/pdf/SPC_v13.pdf  http://en.wikipedia.org/wiki/Cluster_analysis  http://michael.hahsler.net/SMU/EMIS7332/R/chap8.html  http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html  http://www.autonlab.org/tutorials/gmm14.pdf  http://en.wikipedia.org/wiki/Distance_matrix

×