SlideShare une entreprise Scribd logo
1  sur  34
Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
K- Means Clustering
Parameter Tuning & Use cases
Terminologies
Introduction & Example
Standard input/tuning parameters & Sample UI
Sample output UI
Interpretation of Output
Limitations
Business use cases
What Are
All Covered
Introduction With
Example
What is it used for?
It’s a process by which objects are classified into
number of groups so that they are as much
dissimilar as possible from one group to another
group and as much similar as possible within
each group
Thus it’s simply a grouping of similar things
/data points
For example ,objects within group 1(cluster 1)
shown in image above should be as similar as
possible
But there should be much difference between
an object in group 1 & group 2
The attributes of objects decide which objects
should be grouped together
Thus natural grouping of data points can be
achieved
Some Examples
Let’s
take a
few
examples
for more
clarity :
Loan applicants in a bank can be grouped into : low ,
medium , high risk applicants based on their age, annual
income ,employment tenure, loan amount , times
delinquent etc. using K means clustering algorithm
Movie tickets booking website users can be grouped into
movie freaks/moderate watchers/ rare watchers based
on their past movie tickets purchase behavior such as
days from last movie seen , average number of tickets
booked each time , frequency of tickets booking per
month , etc.
Retail customers can be clubbed into loyal / infrequent /
rare customer groups based on their retail
outlet/website visits per month , purchase amount per
month , purchase frequency per month etc.
It is used to find groups which have not been
explicitly labeled in the data. This can be
used to confirm business assumptions about
what types of groups exist or to identify
unknown groups in complex data sets.
Once the algorithm has been run and the
groups are defined, any new data can be
easily assigned to the correct group
How it works
Step 1: Begin with a decision on
the value of k : Number of clusters
(groups) and input variables. Use
silhouette score to determine k.
Step 2: Scale the data using [(x-
min(x)/max(x)-min(x)] and initialize
cluster centers. Randomly select k
observations from the scaled data
and consider them as initial cluster
centers.
Step 3: Calculate euclidean
distance between an observation
and initial cluster centers.
• Based on euclidean distance, each
observation is assigned to one of the
clusters - based on minimum distance.
Step 4: Move onto next
observation , calculate euclidean
distance, update cluster centers
and assign this observation a
cluster membership based on
minimum distance same as step 3.
Step 5: Repeat step 4 until all
observations are assigned a cluster
membership.
Step 6 : Check cluster plot and
silhouette score to measure the
goodness of clusters generated.
How it works – Steps
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Data sample :
Cluster
Initial cluster centers
Height Weight
K1 185 72
K2 170 56
Step 1: Input
• Scaled variables and Number of
Clusters (k)
• In this example, only two
variables –height and weight –
are considered for clustering
• Let’s consider number of clusters
=2
Step 2: Initialize cluster
centers
• Let’s initialize cluster centers
with first two observations
Step 3: Calculate
Euclidean distance
• Euclidean distance between an
observation and initial cluster
centers 1 and 2 is calculated.
• Based on Euclidean distance,
each observation is assigned to
one of the clusters - based
on minimum distance
Example
Height Weight
185 72
170 56
First two observations
Cluster Height Weight
K1 185 72
K2 170 56
Updated centers
Euclidian Distance from
Cluster 1
Euclidian Distance from
Cluster 2
Cluster
Assignment
SQRT [(185-185)2+(72-72)2 ] =0
SQRT [(185-170)2+(72-56)2] =
21.93
1
SQRT [(170-185)2+(56-72)2] =
21.93
SQRT [(170-170)2+(56-56)2] = 0 2
Euclidean Distance from each of the clusters is calculated:
Step 3: Continue…
There is no change in centers as we considered same two observations as initial centers
Example
Height Weight
168 60
Next observation
Cluster Height Weight
K1 185 72
K2
(170 +168)/2
=169
(56 +60)/2=
58
Updated cluster centers
Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2
Cluster
Assignment
SQRT [(168-185) 2+(60-72) 2] =20.808 SQRT[((168-170)2+(60-56) 2] = 4.472 2
Step 4 :
Move onto next observation,
calculate euclidean distance,
assign cluster membership and
update cluster centers
• Since distance is minimum from cluster 2, the observation is assigned to cluster 2.
• Now revise Cluster centers – Mean value of observations’ Height and Weight.
• Addition is only to cluster 2, so centroid of cluster 2 will be updated as follows :
Example
Height Weight
179 68
Next observation
Cluster Height Weight
K1
(185 +179)/2
=182
(72 +68)/2
=70
K2 169 58
Updated cluster centers
Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2
Cluster
Assignment
SQRT [(179-185) 2+(68-72) 2] =7.21 SQRT[((179-170)2+(68-56) 2] = 14.14 1
Step 5:
Repeat steps 4 : calculate
Euclidean distance for next
observation, assign next
observation based on minimum
distance & update the cluster
centers until all observations are
assigned a cluster membership
o Since distance is minimum from cluster 1, the observation is assigned to cluster 1.
o Now revise Cluster Centroid – Mean value of observations’ Height and Weight.
o Addition is only to cluster 1, so centroid of cluster 1 will be updated as follows :
Example
Step 6 :
Draw cluster plot to see
how clusters are
distributed. Lesser the
overlap between
clusters , better the
distribution and cluster
assignments.
Cluster
Updated
Centroid
Height Weight
K=1 182.8 72
K=2 169 58
Final assignments
Final cluster centers
Cluster plot
silhouette = 0.8
Indicating very good quality clusters
Closer the silhouette score to 1 , better the quality of clusters
Example
Standard Tuning
Parameters
Standard Tuning Parameter
oNumber of clusters (K) :
• The desired number of clusters
• Suggested range : 3 to 5
• The actual number could be
smaller in the output if there are
no divisible clusters in the data
• This parameter input can be
automated using silhouette
score(explained in later slides).
Max Iterations:
• The max number of k-means
iterations to split clusters
• By default this value should be
set to 20
Sample UI For Input
Variables & Parameters
Selection & Output
Sample UI for selecting predictors and applying
tuning parameters: For Two Predictors
Select the variables you
would like to use as
predictors to build
clusters
Height
Weight
BMI
21
Tuning parameters
Number of clusters
Maximum iterations
 The silhouette score is another useful criterion for assessing the natural and optimal
number of clusters as well as for checking overall quality of partition
 The largest silhouette score, over different K, indicates the best number of clusters
Height(cm) Weight(Kg) BMI
Cluster
Number
158 60 23 1
160 65 25 2
170 70 26 2
149 50 21 1
180 80 27 3
165 80 28 3
200 90 23 1
Each customer is assigned a cluster membership as shown in the table in left
Height
Weight
silhouette = 0.7
Indicating good quality clusters
Output UI: For Two
Predictors
As clusters are built using only 2 predictors
here , scatter plot axis will reflect actual
predictors i.e. Height and Weight instead of
principle components .
Again, lesser the overlap in cluster outlines ,
better the clusters’ assignment
Alternatively silhouette score can be checked
to evaluate how clusters are partitioned -
Closer this value to 1 , better the partition
quality.
Sample UI for Selecting Predictors And Applying
Tuning Parameters: For Four Predictors
Select the variables you
would like to use as
predictors to build
clusters
Purchase amount
Purchase frequency
Total purchase quantity
Annual income
Website visits
21
Tuning parameters
Number of clusters
Maximum iterations
 The silhouette score is another useful criterion for assessing the natural and optimal
number of clusters as well as for checking overall quality of partition
 The largest silhouette score, over different K, indicates the best number of clusters
Customer ID
Purchase
amount(per
month)
Purchase
frequency
(per month)
Total
purchase
quantity
Annual
Income slab
Website
visits (per
month)
Cluster
Number
1 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2
2 1k to 5k 1 4 1 lac to 2 lac <3 1
3 10 to 15k 3 8 4 lac to 6 lac 6 to 10 3
4 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2
5 10 to 15k 3 8 4 lac to 6 lac 6 to 10 3
6 1k to 5k 1 2 1 lac to 2 lac <3 1
7 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2
8 1k to 5k 1 2 1 lac to 2 lac <3 1
9 1k to 5k 1 2 1 lac to 2 lac <3 1
10 >20k 4 32
10 lac to 15
lac
>10 3
Each customer is assigned a cluster membership as shown in the table in left
First principal component
Secondprincipalcomponent
Output UI: For Four
Predictors
In the 2D cluster plot shown in right , clusters
distribution is plotted.
In this case, axis will reflect first two principle
components (check definition below ) instead of
actual predictors as number of predictors is >2. In
case of 3D plot , first three principal components
will be shown. Lesser the overlap between clusters
, better the clusters’ assignment.
Also silhouette score can be checked to evaluate
how clusters are partitioned - Closer this value to 1
, better the partition quality.
Whenever there are more than 2 predictors as
input , axis will reflect principle components
instead of actual predictors
Principle components are linear combination of
original predictors which captures the maximum
variance in data set.
Most of the variance in data is explained by first
three principle components so we can ignore
remaining components.
Other Sample Output
Formats
silhouette = -0.5
Indicating poor quality clusters
silhouette = 0.7
Indicating good quality clusters
 Clusters with silhouette score closer to 1 are more desirable. So this index can be
used to measure the goodness of clusters.
silhouette = 0.3
Average quality clusters
Other Sample Output Formats
Limitations
Limitations
• The number of clusters, k, must be determined before hand. Instead the
algorithm should auto suggest this number for better user friendliness.
• It does not yield the same result with each run, since the resulting clusters
depend on the initial random assignments for group centers.
• If it is inputted in a different order it may produce different cluster if the number
of data points are few, hence number of data points must be large enough.
• It has been suggested that 2m can be used (where m = number of clustering variables) as a
rule to decide sample data size.
• K-means is suitable only for numeric data.
• Scale of data points influences Euclidean distance , so variable standardization
becomes necessary.
• Empty clusters can be obtained if no points are allocated to a cluster during the
assignment step.
Applications & Business
Use Cases
General applications of K-means Clustering
• Some examples of use cases are:
• Behavioral segmentation:
• Segment customers by purchase history
• Segment users by activities on application, website, or platform
• Define personas based on interests
• Create consumer profiles based on activity monitoring
• Some other general applications :
• Pattern recognitions, Market segmentation, Classification analysis, Artificial
intelligence, Image processing , astronomy , agriculture and many others.
Use case 1
• Business problem :
• Grouping loan applicants into high/medium/low risk applicants based on
attributes such as Loan amount , Monthly installment, Employment tenure ,
Times delinquent, Annual income, Debt to income ratio etc.
• Business benefit:
• Once segments are identified , bank will have a loan applicants’ dataset with
each applicant labeled as high/medium/low risk.
• Based on this labels , bank can easily make a decision on whether to give loan
to an applicant or not and if yes then how much credit limit and interest rate
each applicant is eligible for based on the amount of risk involved.
Use case 1
Customer ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
1039153 21000 701.73 105000 9 5 4
1069697 15000 483.38 92000 11 5 2
1068120 25600 824.96 110000 10 9 2
563175 23000 534.94 80000 9 2 12
562842 19750 483.65 57228 11 3 21
562681 25000 571.78 113000 10 0 9
562404 21250 471.2 31008 12 1 12
700159 14400 448.99 82000 20 6 6
696484 10000 241.33 45000 18 8 2
702598 11700 381.61 45192 20 7 3
702470 10000 243.29 38000 17 9 7
702373 4800 144.77 54000 19 8 2
701975 12500 455.81 43560 15 8 4
Input dataset :
Use case 1
Output : Each record will have the cluster (segment) assignment as shown below :
Customer ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Cluster
1039153 21000 701.73 105000 9 5 4 Medium
1069697 15000 483.38 92000 11 5 2 Medium
1068120 25600 824.96 110000 10 9 2 Medium
563175 23000 534.94 80000 9 2 12 Low
562842 19750 483.65 57228 11 3 21 Low
562681 25000 571.78 113000 10 0 9 Low
562404 21250 471.2 31008 12 1 12 Low
700159 14400 448.99 82000 20 6 6 High
696484 10000 241.33 45000 18 8 2 High
702598 11700 381.61 45192 20 7 3 High
702470 10000 243.29 38000 17 9 7 High
702373 4800 144.77 54000 19 8 2 High
701975 12500 455.81 43560 15 8 4 High
Use case 1
Output : Cluster profiles :
As can be seen in the table above,
there are distinctive characteristics of
high /medium and low risk segments
High risk segment has high likelihood
to be delinquent, highest debt to
income ratio and lowest employment
tenure as compared to other two
segments
Whereas low risk segment exhibits
exactly the opposite pattern i.e.
lowest debt to income ratio, lowest
delinquency and highest employment
tenure as compared to other two
segments
Hence , delinquency , employment
tenure and debt to income ratio are
the determinant factors when it
comes to segmenting loan applicants
Cluster
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Risk
Segment
1 10447.30 304.87 66467.74 9.58 1.69 16.82 Low
2 21391.58 598.54 94912.59 12.37 5.98 4.58 Medium
3 7521.32 227.43 60935.28 16.55 6.91 4.01 High
Use case 1
Output : Cluster
distribution:
In the cluster distribution plot , there is
negligible overlap in cluster outlines so we
can say that cluster assignments is good in
our case
Clusters with silhouette width average closer
to 1 are more desirable. So this index can be
used to test the quality of clusters’
distribution.
silhouette = 0.6
Indicating good quality
clusters
Use case 2
Business benefit:
• Once segments are identified
, marketing messages and
even products can be
customized for each segment.
• The better the segment(s)
chosen for targeting by a
particular organization , the
more successful it is assumed
to be in the market place.
Business problem :
• Organizing customers into
groups/segments based on
similar traits, product
preferences and expectations
• Segments are constructed on
basis of the customers’
demographic characteristics,
psychographics, past behavior
and product use behaviors
Use case 3
Business benefit:
• Business marketing team can focus
on risky customer segments in
efficient way in order to avert them
from churning/leaving
• Sales team segments which are
facing challenges based on current
discounting strategy can be
identified and deal negotiation
strategy can be improved
/optimized for them.
Business problem :
• Discount Analysis and Customer
Retention – Visualize ‘segments of
sales group based on discount
behavior’ and ‘customer churn -
segments of customers on verge of
leaving’
Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018

Contenu connexe

Tendances

convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)
RakeshSaran5
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 

Tendances (20)

convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)
 
K means clustering | K Means ++
K means clustering | K Means ++K means clustering | K Means ++
K means clustering | K Means ++
 
Classification vs clustering
Classification vs clusteringClassification vs clustering
Classification vs clustering
 
Predicting house price
Predicting house pricePredicting house price
Predicting house price
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Fuzzy Genetic Algorithm
Fuzzy Genetic AlgorithmFuzzy Genetic Algorithm
Fuzzy Genetic Algorithm
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
RECURSIVE DESCENT PARSING
RECURSIVE DESCENT PARSINGRECURSIVE DESCENT PARSING
RECURSIVE DESCENT PARSING
 
Problem reduction AND OR GRAPH & AO* algorithm.ppt
Problem reduction AND OR GRAPH & AO* algorithm.pptProblem reduction AND OR GRAPH & AO* algorithm.ppt
Problem reduction AND OR GRAPH & AO* algorithm.ppt
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
Ll(1) Parser in Compilers
Ll(1) Parser in CompilersLl(1) Parser in Compilers
Ll(1) Parser in Compilers
 
Stochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxStochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptx
 
Learning set of rules
Learning set of rulesLearning set of rules
Learning set of rules
 
weak slot and filler structure
weak slot and filler structureweak slot and filler structure
weak slot and filler structure
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
AI 7 | Constraint Satisfaction Problem
AI 7 | Constraint Satisfaction ProblemAI 7 | Constraint Satisfaction Problem
AI 7 | Constraint Satisfaction Problem
 
Recognition-of-tokens
Recognition-of-tokensRecognition-of-tokens
Recognition-of-tokens
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 

Similaire à What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to Analyze Data?

Lec13 Clustering.pptx
Lec13 Clustering.pptxLec13 Clustering.pptx
Lec13 Clustering.pptx
Khalid Rabayah
 
K – means cluster analysis.pptx
K – means cluster analysis.pptxK – means cluster analysis.pptx
K – means cluster analysis.pptx
agniva pradhan
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 

Similaire à What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to Analyze Data? (20)

What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning Clustering
 
Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.ppt
 
Lec13 Clustering.pptx
Lec13 Clustering.pptxLec13 Clustering.pptx
Lec13 Clustering.pptx
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
Data Mining Lecture_8(a).pptx
Data Mining Lecture_8(a).pptxData Mining Lecture_8(a).pptx
Data Mining Lecture_8(a).pptx
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning Techniques
 
07 learning
07 learning07 learning
07 learning
 
K – means cluster analysis.pptx
K – means cluster analysis.pptxK – means cluster analysis.pptx
K – means cluster analysis.pptx
 
K-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptxK-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptx
 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clustering
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Clustering &amp; classification
Clustering &amp; classificationClustering &amp; classification
Clustering &amp; classification
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
 

Plus de Smarten Augmented Analytics

Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Smarten Augmented Analytics
 
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Smarten Augmented Analytics
 

Plus de Smarten Augmented Analytics (20)

Crime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – SmartenCrime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – Smarten
 
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
What Is Multilayer Perceptron Classifier And How Is It Used For Enterprise An...
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
 
What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?What Is Random Forest Classification And How Can It Help Your Business?
What Is Random Forest Classification And How Can It Help Your Business?
 
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
What is Isotonic Regression and How Can a Business Utilize it to Analyze Data?
 
Students' Academic Performance Predictive Analytics Use Case – Smarten
Students' Academic Performance Predictive Analytics Use Case – SmartenStudents' Academic Performance Predictive Analytics Use Case – Smarten
Students' Academic Performance Predictive Analytics Use Case – Smarten
 
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
 
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
Gradient Boosting Regression Analysis Reveals Dependent Variables and Interre...
 
What is Simple Linear Regression and How Can an Enterprise Use this Technique...
What is Simple Linear Regression and How Can an Enterprise Use this Technique...What is Simple Linear Regression and How Can an Enterprise Use this Technique...
What is Simple Linear Regression and How Can an Enterprise Use this Technique...
 
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
 
Fraud Mitigation Predictive Analytics Use Case – Smarten
Fraud Mitigation Predictive Analytics Use Case – SmartenFraud Mitigation Predictive Analytics Use Case – Smarten
Fraud Mitigation Predictive Analytics Use Case – Smarten
 
Quality Control Predictive Analytics Use Case - Smarten
Quality Control Predictive Analytics Use Case - SmartenQuality Control Predictive Analytics Use Case - Smarten
Quality Control Predictive Analytics Use Case - Smarten
 
Machine Maintenance Management Predictive Analytics Use Case - Smarten
Machine Maintenance Management Predictive Analytics Use Case - SmartenMachine Maintenance Management Predictive Analytics Use Case - Smarten
Machine Maintenance Management Predictive Analytics Use Case - Smarten
 
Predictive Analytics Using External Data Augmented Analytics Use Case - Smarten
Predictive Analytics Using External Data Augmented Analytics Use Case - SmartenPredictive Analytics Using External Data Augmented Analytics Use Case - Smarten
Predictive Analytics Using External Data Augmented Analytics Use Case - Smarten
 
Marketing Optimization Augmented Analytics Use Cases - Smarten
Marketing Optimization Augmented Analytics Use Cases - SmartenMarketing Optimization Augmented Analytics Use Cases - Smarten
Marketing Optimization Augmented Analytics Use Cases - Smarten
 
Human Resource Attrition Augmented Analytics Use Case - Smarten
Human Resource Attrition Augmented Analytics Use Case - SmartenHuman Resource Attrition Augmented Analytics Use Case - Smarten
Human Resource Attrition Augmented Analytics Use Case - Smarten
 
Customer Targeting Augmented Analytics Use Case - Smarten
Customer Targeting Augmented Analytics Use Case - SmartenCustomer Targeting Augmented Analytics Use Case - Smarten
Customer Targeting Augmented Analytics Use Case - Smarten
 
What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?
 
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
What is Multiple Linear Regression and How Can it be Helpful for Business Ana...
 
What is the Independent Samples T Test Method of Analysis and How Can it Bene...
What is the Independent Samples T Test Method of Analysis and How Can it Bene...What is the Independent Samples T Test Method of Analysis and How Can it Bene...
What is the Independent Samples T Test Method of Analysis and How Can it Bene...
 

Dernier

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Dernier (20)

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to Analyze Data?

  • 1. Master the Art of Analytics A Simplistic Explainer Series For Citizen Data Scientists J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
  • 2. K- Means Clustering Parameter Tuning & Use cases
  • 3. Terminologies Introduction & Example Standard input/tuning parameters & Sample UI Sample output UI Interpretation of Output Limitations Business use cases What Are All Covered
  • 5. What is it used for? It’s a process by which objects are classified into number of groups so that they are as much dissimilar as possible from one group to another group and as much similar as possible within each group Thus it’s simply a grouping of similar things /data points For example ,objects within group 1(cluster 1) shown in image above should be as similar as possible But there should be much difference between an object in group 1 & group 2 The attributes of objects decide which objects should be grouped together Thus natural grouping of data points can be achieved
  • 6. Some Examples Let’s take a few examples for more clarity : Loan applicants in a bank can be grouped into : low , medium , high risk applicants based on their age, annual income ,employment tenure, loan amount , times delinquent etc. using K means clustering algorithm Movie tickets booking website users can be grouped into movie freaks/moderate watchers/ rare watchers based on their past movie tickets purchase behavior such as days from last movie seen , average number of tickets booked each time , frequency of tickets booking per month , etc. Retail customers can be clubbed into loyal / infrequent / rare customer groups based on their retail outlet/website visits per month , purchase amount per month , purchase frequency per month etc. It is used to find groups which have not been explicitly labeled in the data. This can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group
  • 8. Step 1: Begin with a decision on the value of k : Number of clusters (groups) and input variables. Use silhouette score to determine k. Step 2: Scale the data using [(x- min(x)/max(x)-min(x)] and initialize cluster centers. Randomly select k observations from the scaled data and consider them as initial cluster centers. Step 3: Calculate euclidean distance between an observation and initial cluster centers. • Based on euclidean distance, each observation is assigned to one of the clusters - based on minimum distance. Step 4: Move onto next observation , calculate euclidean distance, update cluster centers and assign this observation a cluster membership based on minimum distance same as step 3. Step 5: Repeat step 4 until all observations are assigned a cluster membership. Step 6 : Check cluster plot and silhouette score to measure the goodness of clusters generated. How it works – Steps
  • 9. Height Weight 185 72 170 56 168 60 179 68 182 72 188 77 180 71 180 70 183 84 180 88 180 67 177 76 Data sample : Cluster Initial cluster centers Height Weight K1 185 72 K2 170 56 Step 1: Input • Scaled variables and Number of Clusters (k) • In this example, only two variables –height and weight – are considered for clustering • Let’s consider number of clusters =2 Step 2: Initialize cluster centers • Let’s initialize cluster centers with first two observations Step 3: Calculate Euclidean distance • Euclidean distance between an observation and initial cluster centers 1 and 2 is calculated. • Based on Euclidean distance, each observation is assigned to one of the clusters - based on minimum distance Example
  • 10. Height Weight 185 72 170 56 First two observations Cluster Height Weight K1 185 72 K2 170 56 Updated centers Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2 Cluster Assignment SQRT [(185-185)2+(72-72)2 ] =0 SQRT [(185-170)2+(72-56)2] = 21.93 1 SQRT [(170-185)2+(56-72)2] = 21.93 SQRT [(170-170)2+(56-56)2] = 0 2 Euclidean Distance from each of the clusters is calculated: Step 3: Continue… There is no change in centers as we considered same two observations as initial centers Example
  • 11. Height Weight 168 60 Next observation Cluster Height Weight K1 185 72 K2 (170 +168)/2 =169 (56 +60)/2= 58 Updated cluster centers Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2 Cluster Assignment SQRT [(168-185) 2+(60-72) 2] =20.808 SQRT[((168-170)2+(60-56) 2] = 4.472 2 Step 4 : Move onto next observation, calculate euclidean distance, assign cluster membership and update cluster centers • Since distance is minimum from cluster 2, the observation is assigned to cluster 2. • Now revise Cluster centers – Mean value of observations’ Height and Weight. • Addition is only to cluster 2, so centroid of cluster 2 will be updated as follows : Example
  • 12. Height Weight 179 68 Next observation Cluster Height Weight K1 (185 +179)/2 =182 (72 +68)/2 =70 K2 169 58 Updated cluster centers Euclidian Distance from Cluster 1 Euclidian Distance from Cluster 2 Cluster Assignment SQRT [(179-185) 2+(68-72) 2] =7.21 SQRT[((179-170)2+(68-56) 2] = 14.14 1 Step 5: Repeat steps 4 : calculate Euclidean distance for next observation, assign next observation based on minimum distance & update the cluster centers until all observations are assigned a cluster membership o Since distance is minimum from cluster 1, the observation is assigned to cluster 1. o Now revise Cluster Centroid – Mean value of observations’ Height and Weight. o Addition is only to cluster 1, so centroid of cluster 1 will be updated as follows : Example
  • 13. Step 6 : Draw cluster plot to see how clusters are distributed. Lesser the overlap between clusters , better the distribution and cluster assignments. Cluster Updated Centroid Height Weight K=1 182.8 72 K=2 169 58 Final assignments Final cluster centers Cluster plot silhouette = 0.8 Indicating very good quality clusters Closer the silhouette score to 1 , better the quality of clusters Example
  • 15. Standard Tuning Parameter oNumber of clusters (K) : • The desired number of clusters • Suggested range : 3 to 5 • The actual number could be smaller in the output if there are no divisible clusters in the data • This parameter input can be automated using silhouette score(explained in later slides). Max Iterations: • The max number of k-means iterations to split clusters • By default this value should be set to 20
  • 16. Sample UI For Input Variables & Parameters Selection & Output
  • 17. Sample UI for selecting predictors and applying tuning parameters: For Two Predictors Select the variables you would like to use as predictors to build clusters Height Weight BMI 21 Tuning parameters Number of clusters Maximum iterations  The silhouette score is another useful criterion for assessing the natural and optimal number of clusters as well as for checking overall quality of partition  The largest silhouette score, over different K, indicates the best number of clusters
  • 18. Height(cm) Weight(Kg) BMI Cluster Number 158 60 23 1 160 65 25 2 170 70 26 2 149 50 21 1 180 80 27 3 165 80 28 3 200 90 23 1 Each customer is assigned a cluster membership as shown in the table in left Height Weight silhouette = 0.7 Indicating good quality clusters Output UI: For Two Predictors As clusters are built using only 2 predictors here , scatter plot axis will reflect actual predictors i.e. Height and Weight instead of principle components . Again, lesser the overlap in cluster outlines , better the clusters’ assignment Alternatively silhouette score can be checked to evaluate how clusters are partitioned - Closer this value to 1 , better the partition quality.
  • 19. Sample UI for Selecting Predictors And Applying Tuning Parameters: For Four Predictors Select the variables you would like to use as predictors to build clusters Purchase amount Purchase frequency Total purchase quantity Annual income Website visits 21 Tuning parameters Number of clusters Maximum iterations  The silhouette score is another useful criterion for assessing the natural and optimal number of clusters as well as for checking overall quality of partition  The largest silhouette score, over different K, indicates the best number of clusters
  • 20. Customer ID Purchase amount(per month) Purchase frequency (per month) Total purchase quantity Annual Income slab Website visits (per month) Cluster Number 1 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2 2 1k to 5k 1 4 1 lac to 2 lac <3 1 3 10 to 15k 3 8 4 lac to 6 lac 6 to 10 3 4 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2 5 10 to 15k 3 8 4 lac to 6 lac 6 to 10 3 6 1k to 5k 1 2 1 lac to 2 lac <3 1 7 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2 8 1k to 5k 1 2 1 lac to 2 lac <3 1 9 1k to 5k 1 2 1 lac to 2 lac <3 1 10 >20k 4 32 10 lac to 15 lac >10 3 Each customer is assigned a cluster membership as shown in the table in left First principal component Secondprincipalcomponent Output UI: For Four Predictors In the 2D cluster plot shown in right , clusters distribution is plotted. In this case, axis will reflect first two principle components (check definition below ) instead of actual predictors as number of predictors is >2. In case of 3D plot , first three principal components will be shown. Lesser the overlap between clusters , better the clusters’ assignment. Also silhouette score can be checked to evaluate how clusters are partitioned - Closer this value to 1 , better the partition quality. Whenever there are more than 2 predictors as input , axis will reflect principle components instead of actual predictors Principle components are linear combination of original predictors which captures the maximum variance in data set. Most of the variance in data is explained by first three principle components so we can ignore remaining components.
  • 22. silhouette = -0.5 Indicating poor quality clusters silhouette = 0.7 Indicating good quality clusters  Clusters with silhouette score closer to 1 are more desirable. So this index can be used to measure the goodness of clusters. silhouette = 0.3 Average quality clusters Other Sample Output Formats
  • 24. Limitations • The number of clusters, k, must be determined before hand. Instead the algorithm should auto suggest this number for better user friendliness. • It does not yield the same result with each run, since the resulting clusters depend on the initial random assignments for group centers. • If it is inputted in a different order it may produce different cluster if the number of data points are few, hence number of data points must be large enough. • It has been suggested that 2m can be used (where m = number of clustering variables) as a rule to decide sample data size. • K-means is suitable only for numeric data. • Scale of data points influences Euclidean distance , so variable standardization becomes necessary. • Empty clusters can be obtained if no points are allocated to a cluster during the assignment step.
  • 26. General applications of K-means Clustering • Some examples of use cases are: • Behavioral segmentation: • Segment customers by purchase history • Segment users by activities on application, website, or platform • Define personas based on interests • Create consumer profiles based on activity monitoring • Some other general applications : • Pattern recognitions, Market segmentation, Classification analysis, Artificial intelligence, Image processing , astronomy , agriculture and many others.
  • 27. Use case 1 • Business problem : • Grouping loan applicants into high/medium/low risk applicants based on attributes such as Loan amount , Monthly installment, Employment tenure , Times delinquent, Annual income, Debt to income ratio etc. • Business benefit: • Once segments are identified , bank will have a loan applicants’ dataset with each applicant labeled as high/medium/low risk. • Based on this labels , bank can easily make a decision on whether to give loan to an applicant or not and if yes then how much credit limit and interest rate each applicant is eligible for based on the amount of risk involved.
  • 28. Use case 1 Customer ID Loan amount Monthly installment Annual income Debt to income ratio Times delinquent Employment tenure 1039153 21000 701.73 105000 9 5 4 1069697 15000 483.38 92000 11 5 2 1068120 25600 824.96 110000 10 9 2 563175 23000 534.94 80000 9 2 12 562842 19750 483.65 57228 11 3 21 562681 25000 571.78 113000 10 0 9 562404 21250 471.2 31008 12 1 12 700159 14400 448.99 82000 20 6 6 696484 10000 241.33 45000 18 8 2 702598 11700 381.61 45192 20 7 3 702470 10000 243.29 38000 17 9 7 702373 4800 144.77 54000 19 8 2 701975 12500 455.81 43560 15 8 4 Input dataset :
  • 29. Use case 1 Output : Each record will have the cluster (segment) assignment as shown below : Customer ID Loan amount Monthly installment Annual income Debt to income ratio Times delinquent Employment tenure Cluster 1039153 21000 701.73 105000 9 5 4 Medium 1069697 15000 483.38 92000 11 5 2 Medium 1068120 25600 824.96 110000 10 9 2 Medium 563175 23000 534.94 80000 9 2 12 Low 562842 19750 483.65 57228 11 3 21 Low 562681 25000 571.78 113000 10 0 9 Low 562404 21250 471.2 31008 12 1 12 Low 700159 14400 448.99 82000 20 6 6 High 696484 10000 241.33 45000 18 8 2 High 702598 11700 381.61 45192 20 7 3 High 702470 10000 243.29 38000 17 9 7 High 702373 4800 144.77 54000 19 8 2 High 701975 12500 455.81 43560 15 8 4 High
  • 30. Use case 1 Output : Cluster profiles : As can be seen in the table above, there are distinctive characteristics of high /medium and low risk segments High risk segment has high likelihood to be delinquent, highest debt to income ratio and lowest employment tenure as compared to other two segments Whereas low risk segment exhibits exactly the opposite pattern i.e. lowest debt to income ratio, lowest delinquency and highest employment tenure as compared to other two segments Hence , delinquency , employment tenure and debt to income ratio are the determinant factors when it comes to segmenting loan applicants Cluster Loan amount Monthly installment Annual income Debt to income ratio Times delinquent Employment tenure Risk Segment 1 10447.30 304.87 66467.74 9.58 1.69 16.82 Low 2 21391.58 598.54 94912.59 12.37 5.98 4.58 Medium 3 7521.32 227.43 60935.28 16.55 6.91 4.01 High
  • 31. Use case 1 Output : Cluster distribution: In the cluster distribution plot , there is negligible overlap in cluster outlines so we can say that cluster assignments is good in our case Clusters with silhouette width average closer to 1 are more desirable. So this index can be used to test the quality of clusters’ distribution. silhouette = 0.6 Indicating good quality clusters
  • 32. Use case 2 Business benefit: • Once segments are identified , marketing messages and even products can be customized for each segment. • The better the segment(s) chosen for targeting by a particular organization , the more successful it is assumed to be in the market place. Business problem : • Organizing customers into groups/segments based on similar traits, product preferences and expectations • Segments are constructed on basis of the customers’ demographic characteristics, psychographics, past behavior and product use behaviors
  • 33. Use case 3 Business benefit: • Business marketing team can focus on risky customer segments in efficient way in order to avert them from churning/leaving • Sales team segments which are facing challenges based on current discounting strategy can be identified and deal negotiation strategy can be improved /optimized for them. Business problem : • Discount Analysis and Customer Retention – Visualize ‘segments of sales group based on discount behavior’ and ‘customer churn - segments of customers on verge of leaving’
  • 34. Want to Learn More? Get in touch with us @ support@Smarten.com And Do Checkout the Learning section on Smarten.com June 2018