What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to Analyze Data?

Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s

K- Means Clustering
Parameter Tuning & Use cases

Terminologies
Introduction & Example
Standard input/tuning parameters & Sample UI
Sample output UI
Interpretation of Output
Limitations
Business use cases
What Are
All Covered

What is it used for?
It’s a process by which objects are classified into
number of groups so that they are as much
dissimilar as possible from one group to another
group and as much similar as possible within
each group
Thus it’s simply a grouping of similar things
/data points
For example ,objects within group 1(cluster 1)
shown in image above should be as similar as
possible
But there should be much difference between
an object in group 1 & group 2
The attributes of objects decide which objects
should be grouped together
Thus natural grouping of data points can be
achieved

Some Examples
Let’s
take a
few
examples
for more
clarity :
Loan applicants in a bank can be grouped into : low ,
medium , high risk applicants based on their age, annual
income ,employment tenure, loan amount , times
delinquent etc. using K means clustering algorithm
Movie tickets booking website users can be grouped into
movie freaks/moderate watchers/ rare watchers based
on their past movie tickets purchase behavior such as
days from last movie seen , average number of tickets
booked each time , frequency of tickets booking per
month , etc.
Retail customers can be clubbed into loyal / infrequent /
rare customer groups based on their retail
outlet/website visits per month , purchase amount per
month , purchase frequency per month etc.
It is used to find groups which have not been
explicitly labeled in the data. This can be
used to confirm business assumptions about
what types of groups exist or to identify
unknown groups in complex data sets.
Once the algorithm has been run and the
groups are defined, any new data can be
easily assigned to the correct group

Step 1: Begin with a decision on
the value of k : Number of clusters
(groups) and input variables. Use
silhouette score to determine k.
Step 2: Scale the data using [(x-
min(x)/max(x)-min(x)] and initialize
cluster centers. Randomly select k
observations from the scaled data
and consider them as initial cluster
centers.
Step 3: Calculate euclidean
distance between an observation
and initial cluster centers.
• Based on euclidean distance, each
observation is assigned to one of the
clusters - based on minimum distance.
Step 4: Move onto next
observation , calculate euclidean
distance, update cluster centers
and assign this observation a
cluster membership based on
minimum distance same as step 3.
Step 5: Repeat step 4 until all
observations are assigned a cluster
membership.
Step 6 : Check cluster plot and
silhouette score to measure the
goodness of clusters generated.
How it works – Steps

Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Data sample :
Cluster
Initial cluster centers
Height Weight
K1 185 72
K2 170 56
Step 1: Input
• Scaled variables and Number of
Clusters (k)
• In this example, only two
variables –height and weight –
are considered for clustering
• Let’s consider number of clusters
=2
Step 2: Initialize cluster
centers
• Let’s initialize cluster centers
with first two observations
Step 3: Calculate
Euclidean distance
• Euclidean distance between an
observation and initial cluster
centers 1 and 2 is calculated.
• Based on Euclidean distance,
each observation is assigned to
one of the clusters - based
on minimum distance
Example

Height Weight
185 72
170 56
First two observations
Cluster Height Weight
K1 185 72
K2 170 56
Updated centers
Euclidian Distance from
Cluster 1
Euclidian Distance from
Cluster 2
Cluster
Assignment
SQRT [(185-185)2+(72-72)2 ] =0
SQRT [(185-170)2+(72-56)2] =
21.93
1
SQRT [(170-185)2+(56-72)2] =
21.93
SQRT [(170-170)2+(56-56)2] = 0 2
Euclidean Distance from each of the clusters is calculated:
Step 3: Continue…
There is no change in centers as we considered same two observations as initial centers
Example

Height Weight
168 60
Next observation
K1 185 72
K2
(170 +168)/2
=169
(56 +60)/2=
58
Updated cluster centers
Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2
Cluster
Assignment
SQRT [(168-185) 2+(60-72) 2] =20.808 SQRT[((168-170)2+(60-56) 2] = 4.472 2
Step 4 :
Move onto next observation,
calculate euclidean distance,
assign cluster membership and
update cluster centers
• Since distance is minimum from cluster 2, the observation is assigned to cluster 2.
• Now revise Cluster centers – Mean value of observations’ Height and Weight.
• Addition is only to cluster 2, so centroid of cluster 2 will be updated as follows :
Example

Height Weight
179 68
Next observation
K1
(185 +179)/2
=182
(72 +68)/2
=70
K2 169 58
Updated cluster centers
Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2
Cluster
Assignment
SQRT [(179-185) 2+(68-72) 2] =7.21 SQRT[((179-170)2+(68-56) 2] = 14.14 1
Step 5:
Repeat steps 4 : calculate
Euclidean distance for next
observation, assign next
observation based on minimum
distance & update the cluster
centers until all observations are
assigned a cluster membership
o Since distance is minimum from cluster 1, the observation is assigned to cluster 1.
o Now revise Cluster Centroid – Mean value of observations’ Height and Weight.
o Addition is only to cluster 1, so centroid of cluster 1 will be updated as follows :
Example

Step 6 :
Draw cluster plot to see
how clusters are
distributed. Lesser the
overlap between
clusters , better the
distribution and cluster
assignments.
Cluster
Updated
Centroid
Height Weight
K=1 182.8 72
K=2 169 58
Final assignments
Final cluster centers
Cluster plot
silhouette = 0.8
Indicating very good quality clusters
Closer the silhouette score to 1 , better the quality of clusters
Example

Standard Tuning Parameter
oNumber of clusters (K) :
• The desired number of clusters
• Suggested range : 3 to 5
• The actual number could be
smaller in the output if there are
no divisible clusters in the data
• This parameter input can be
automated using silhouette
score(explained in later slides).
Max Iterations:
• The max number of k-means
iterations to split clusters
• By default this value should be
set to 20

Sample UI For Input
Variables & Parameters
Selection & Output

Sample UI for selecting predictors and applying
tuning parameters: For Two Predictors
Select the variables you
would like to use as
predictors to build
clusters
Height
Weight
BMI
21
Tuning parameters
Number of clusters
Maximum iterations
 The silhouette score is another useful criterion for assessing the natural and optimal
number of clusters as well as for checking overall quality of partition
 The largest silhouette score, over different K, indicates the best number of clusters

Height(cm) Weight(Kg) BMI
Cluster
Number
158 60 23 1
160 65 25 2
170 70 26 2
149 50 21 1
180 80 27 3
165 80 28 3
200 90 23 1
Each customer is assigned a cluster membership as shown in the table in left
Height
Weight
silhouette = 0.7
Indicating good quality clusters
Output UI: For Two
Predictors
As clusters are built using only 2 predictors
here , scatter plot axis will reflect actual
predictors i.e. Height and Weight instead of
principle components .
Again, lesser the overlap in cluster outlines ,
better the clusters’ assignment
Alternatively silhouette score can be checked
to evaluate how clusters are partitioned -
Closer this value to 1 , better the partition
quality.

Sample UI for Selecting Predictors And Applying
Tuning Parameters: For Four Predictors
Select the variables you
would like to use as
predictors to build
clusters
Purchase amount
Purchase frequency
Total purchase quantity
Annual income
Website visits
21
Tuning parameters
Number of clusters
Maximum iterations
 The silhouette score is another useful criterion for assessing the natural and optimal
number of clusters as well as for checking overall quality of partition
 The largest silhouette score, over different K, indicates the best number of clusters

Customer ID
Purchase
amount(per
month)
Purchase
frequency
(per month)
Total
purchase
quantity
Annual
Income slab
Website
visits (per
month)
Cluster
Number
1 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2
2 1k to 5k 1 4 1 lac to 2 lac <3 1
3 10 to 15k 3 8 4 lac to 6 lac 6 to 10 3
4 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2
5 10 to 15k 3 8 4 lac to 6 lac 6 to 10 3
6 1k to 5k 1 2 1 lac to 2 lac <3 1
7 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2
8 1k to 5k 1 2 1 lac to 2 lac <3 1
9 1k to 5k 1 2 1 lac to 2 lac <3 1
10 >20k 4 32
10 lac to 15
lac
>10 3
Each customer is assigned a cluster membership as shown in the table in left
First principal component
Secondprincipalcomponent
Output UI: For Four
Predictors
In the 2D cluster plot shown in right , clusters
distribution is plotted.
In this case, axis will reflect first two principle
components (check definition below ) instead of
actual predictors as number of predictors is >2. In
case of 3D plot , first three principal components
will be shown. Lesser the overlap between clusters
, better the clusters’ assignment.
Also silhouette score can be checked to evaluate
how clusters are partitioned - Closer this value to 1
, better the partition quality.
Whenever there are more than 2 predictors as
input , axis will reflect principle components
instead of actual predictors
Principle components are linear combination of
original predictors which captures the maximum
variance in data set.
Most of the variance in data is explained by first
three principle components so we can ignore
remaining components.

silhouette = -0.5
Indicating poor quality clusters
silhouette = 0.7
Indicating good quality clusters
 Clusters with silhouette score closer to 1 are more desirable. So this index can be
used to measure the goodness of clusters.
silhouette = 0.3
Average quality clusters
Other Sample Output Formats

Limitations
• The number of clusters, k, must be determined before hand. Instead the
algorithm should auto suggest this number for better user friendliness.
• It does not yield the same result with each run, since the resulting clusters
depend on the initial random assignments for group centers.
• If it is inputted in a different order it may produce different cluster if the number
of data points are few, hence number of data points must be large enough.
• It has been suggested that 2m can be used (where m = number of clustering variables) as a
rule to decide sample data size.
• K-means is suitable only for numeric data.
• Scale of data points influences Euclidean distance , so variable standardization
becomes necessary.
• Empty clusters can be obtained if no points are allocated to a cluster during the
assignment step.

Applications & Business
Use Cases

General applications of K-means Clustering
• Some examples of use cases are:
• Behavioral segmentation:
• Segment customers by purchase history
• Segment users by activities on application, website, or platform
• Define personas based on interests
• Create consumer profiles based on activity monitoring
• Some other general applications :
• Pattern recognitions, Market segmentation, Classification analysis, Artificial
intelligence, Image processing , astronomy , agriculture and many others.

Use case 1
• Business problem :
• Grouping loan applicants into high/medium/low risk applicants based on
attributes such as Loan amount , Monthly installment, Employment tenure ,
Times delinquent, Annual income, Debt to income ratio etc.
• Business benefit:
• Once segments are identified , bank will have a loan applicants’ dataset with
each applicant labeled as high/medium/low risk.
• Based on this labels , bank can easily make a decision on whether to give loan
to an applicant or not and if yes then how much credit limit and interest rate
each applicant is eligible for based on the amount of risk involved.

Use case 1
Customer ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
1039153 21000 701.73 105000 9 5 4
1069697 15000 483.38 92000 11 5 2
1068120 25600 824.96 110000 10 9 2
563175 23000 534.94 80000 9 2 12
562842 19750 483.65 57228 11 3 21
562681 25000 571.78 113000 10 0 9
562404 21250 471.2 31008 12 1 12
700159 14400 448.99 82000 20 6 6
696484 10000 241.33 45000 18 8 2
702598 11700 381.61 45192 20 7 3
702470 10000 243.29 38000 17 9 7
702373 4800 144.77 54000 19 8 2
701975 12500 455.81 43560 15 8 4
Input dataset :

Use case 1
Output : Each record will have the cluster (segment) assignment as shown below :
Customer ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Cluster
1039153 21000 701.73 105000 9 5 4 Medium
1069697 15000 483.38 92000 11 5 2 Medium
1068120 25600 824.96 110000 10 9 2 Medium
563175 23000 534.94 80000 9 2 12 Low
562842 19750 483.65 57228 11 3 21 Low
562681 25000 571.78 113000 10 0 9 Low
562404 21250 471.2 31008 12 1 12 Low
700159 14400 448.99 82000 20 6 6 High
696484 10000 241.33 45000 18 8 2 High
702598 11700 381.61 45192 20 7 3 High
702470 10000 243.29 38000 17 9 7 High
702373 4800 144.77 54000 19 8 2 High
701975 12500 455.81 43560 15 8 4 High

Use case 1
Output : Cluster profiles :
As can be seen in the table above,
there are distinctive characteristics of
high /medium and low risk segments
High risk segment has high likelihood
to be delinquent, highest debt to
income ratio and lowest employment
tenure as compared to other two
segments
Whereas low risk segment exhibits
exactly the opposite pattern i.e.
lowest debt to income ratio, lowest
delinquency and highest employment
tenure as compared to other two
segments
Hence , delinquency , employment
tenure and debt to income ratio are
the determinant factors when it
comes to segmenting loan applicants
Cluster
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Risk
Segment
1 10447.30 304.87 66467.74 9.58 1.69 16.82 Low
2 21391.58 598.54 94912.59 12.37 5.98 4.58 Medium
3 7521.32 227.43 60935.28 16.55 6.91 4.01 High

Use case 1
Output : Cluster
distribution:
In the cluster distribution plot , there is
negligible overlap in cluster outlines so we
can say that cluster assignments is good in
our case
Clusters with silhouette width average closer
to 1 are more desirable. So this index can be
used to test the quality of clusters’
distribution.
silhouette = 0.6
Indicating good quality
clusters

Use case 2
Business benefit:
• Once segments are identified
, marketing messages and
even products can be
customized for each segment.
• The better the segment(s)
chosen for targeting by a
particular organization , the
more successful it is assumed
to be in the market place.
Business problem :
• Organizing customers into
groups/segments based on
similar traits, product
preferences and expectations
• Segments are constructed on
basis of the customers’
demographic characteristics,
psychographics, past behavior
and product use behaviors

Use case 3
Business benefit:
• Business marketing team can focus
on risky customer segments in
efficient way in order to avert them
from churning/leaving
• Sales team segments which are
facing challenges based on current
discounting strategy can be
identified and deal negotiation
strategy can be improved
/optimized for them.
Business problem :
• Discount Analysis and Customer
Retention – Visualize ‘segments of
sales group based on discount
behavior’ and ‘customer churn -
segments of customers on verge of
leaving’

Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018

What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to Analyze Data?

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to Analyze Data?

Similaire à What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to Analyze Data? (20)

Plus de Smarten Augmented Analytics

Plus de Smarten Augmented Analytics (20)

Dernier

Dernier (20)

What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to Analyze Data?