A brief introduction to clustering with Scikit learn. In this presentation, we provide an overview with real examples of how to make use and optimize within k-means clustering.
2. About Me
• Chief Data Scientist, WPC Healthcare
• Speaker
• Researcher
• Writer
3. Outline
• What is k-means clustering?
• How does it work?
• When is it appropriate to use it?
• K-means clustering in scikit-learn
• Basic
• Basic with adjustments
4. Clustering
• It is unsupervised learning (inferring a function to
describe not so obvious structures from
unlabeled data)
• Groups data objects
• Measures distance between data points
• Helps in examining the data
5. K-means Clustering
• Formally: a method of vector quantization
• Informally: a mapping of a large set of inputs to a
(countable smaller set)
• Separate data into
groups with equal
variance
• Makes use of the
Euclidean
distance metric
6. K-means Clustering
Repeats refinement
Three basic steps:
• Step 1: Choose k (how many groups)
• Repeat over:
• Step 2: Assignment (labeling data as part of a group)
• Step 3: Update
This process continues until its goal is reached
8. K-means Clustering
• Advantages
• Large data accepted
• Fast
• Will always find a solution
• Disadvantages
• Choosing the wrong number of groups
• You reach a local optima not a global
9. K-means Clustering
• When to use
• Normally distributed data
• Large number of samples
• Not too many clusters
• Distance can be measured in a linear fashion
16. K-means Parameters
• n_clusters
• Number of clusters to form
• max_iter
• Maximum number of repeats for algo in a single run
• n_init
• Number of times k-means algo will run with different initialization points
• init
• Method you want to initialize with
• precompute_distances
• Selection of Yes, No, or let the machine decide
• Tol
• How tolerable should the algo be when it converges
• n_jobs
• How many CPUs do you want to engage when running the algo
• random_state
• What instance should be the starting point for the algo
17. n_clusters: choosing k
• View the variance
• cdist is the distance between sets of observations
• pdist is the pairwise distances between observations in
the same set
18. n_clusters: choosing k
Step 1: Determine your k range
Step 2: Fit the k-means model for each n_clusters = k
Step 3: Pull out the cluster centers for each model
19. n_clusters: choosing k
Step 4: Calculate Euclidean distance from each point to each cluster center
Step 5: Total within-cluster sum of squares
Step 6: Total sum of squares
Step 7: Difference between-cluster sum of squares
23. init
Methods and their meaning:
• k-means++
• Selects initial clusters in a way that speeds up
convergence
• random
• Choose k rows at random for initial centroids
• Ndarray that gives initial centers
• (n_clusters, n_features)
26. Comparing Results: Silhouette Score
• Silhouette coefficient
• Not black and white, lots of gray
• Average distance between data observations and other data
in cluster
• Average distance between data observations and all other
points in the NEXT nearest cluster
• Silhouette score in scikit-learn
• Average silhouette coefficient for all data observations
• The closer to 1, the better the fit
• Computation time increases with larger datasets
28. What Do the Results Say?
• Data patterns may in fact exist
• Similar observations can be grouped
• We need additional discovery
29. A Few Hacks
• Clustering is a great way to explore your data and
develop intution
• Too many features create a problem for
understanding
• Use dimensionality reduction
• Use clustering with other methods
30. Let’s Connect
• Twitter: @DamianMingle
• LinkedIn: DamianRMingle
• Sign-up for Data Science Hacks