SlideShare a Scribd company logo
1 of 10
Download to read offline
Principal Component Analysis
and Clustering
Professor Daymond
27-Nov-2016
UNDERSTANDING BORROWER SEGMENTS
Majority of the accounts are of credit based borrowers whose revolving utilization with the most
revolving accounts and bankcards
Credit based
accounts
The accounts are mostly with fixed instalments like car loans, student loans etc.,
Most instalment accounts and instalment utilization are the major factors of this segment
Fixed
Instalment
accounts
These are borrowers with past due records and most of the late fees of credit and loan amount. Also
with the recent history of delinquency this segment is medium risk
Past due
accounts
These are borrowers who are highly inquired for loans which exhibits the most credit card purchase
behaviour and attempt to try all possible loans for one
Highly
Inquired
accounts
Debt to collection accounts holds the most number of public records like tax liens etc.,
Collections money owed and tax liens are the major factors of this segment
With highest delinquency, exceeded usage of credit limit and multiple accounts in the recent times
makes this segment as high risk
Debt
Collections
accounts
High risk
delinquent
accounts
IDENTIFYINGTHEPRINCIPALCOMPONENTS
With the given dataset(N=27000) and 77 variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can be
envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the data
based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is executed with
all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the principal components
The variance of each principal component is implied Eigen values of the component. The greater the Eigen values, the better the variance is
explained by each component. Hence the break point criteria for components is that the Eigen values must be greater than 1 and the
cumulative variance should be at least 75%.
From the results(Appendix 1), it is observed that there are 18 components with Eigen values greater than 1 and contribute to approximately
76% of the total variance. The coefficients of the principal components are the Eigen vectors(Appendix 2) generally the linear combination of
the inputs which implies the axis length and the direction of each principal components
.From figure 1, scree plot it is observed that curve is almost flat after Eigen value 1 implying that the further components contribute very small
to the variance. Hence there are total of 18 principal components that provides a significant variance of data
Figure 1 Figure 2
INTERPRETINGTHEPRINCIPALCOMPONENTS
In order to interpret the principal components, the correlation matrix of the Eigen Vectors is observed for highest correlation with the original
variables. The data is standardized by PRINCOMP and hence the correlation matrix has values lesser than 1. The values closer to an absolute 1 i.e.
either positive or negative are said to be highly correlated with the original variables.
PRINCIPAL COMPONENT 1
From Figure 4, it is observed that the highest coefficients
are correlated with the various number of accounts i.e. how
valuable are the customers in terms of usage and the least
correlated with the duration since the recent account i.e.
how credible the customers are?
Similarly, each of the principal component is analysed for
the highest and the lowest coefficients and tabulated for
reference.
Figure 3
Figure 4
IDENTIFYINGTHECLUSTERS
Once the principal components are identified, the next step is to feed the principal components in to a cluster and run the FASTCLUS procedure with
various MAXCLUSTERS size ranging from 3 to 20 after PROC STDIZE. FASTCLUS uses k-means clustering, an iterative approach helps to identify the
approximately equal sized clusters with a decent spread. A set of values are selected as Initial Seeds for reference i.e. mean and then the nearest
values are formed as temporary clusters and replaced with the mean of new clusters and this is repeated iteratively until there is no change in
clusters. ‘Complete convergence is satisfied’ implies that the final SEEDS is equal to the
cluster mean.
Summary
The summary of statistics of clusters displays the frequency of observations in each
cluster and the root mean square deviation. The next column displays the largest
distance from the seed to the observation i.e. the total spread of the cluster
approximately. The last column displays the distance from the centre of the cluster
to the centre of the nearest cluster.
Six appropriate sized clusters are obtained with 14 clusters and at 35th iteration.
Cluster 1, 4, 6, 9, 12, 14 are the identified clusters and Cluster 1 is observed to be
the nearest cluster for all the clusters
Goodness-of-fit metrics
The higher values of Pseudo F Statistic are preferred to attain good number of
clusters
R-square accounts for the variance accounted by the clusters
The higher CCC values are indicate good clustering generally expected to be more
than 2 or 3.
Higher F Statistic and CCC implies that the clustering solution is good
IDENTIFYINGTHECLUSTERS
Cluster means and standard deviation of variables are displayed as part of FASTCLUS. Similar to identifying the principal components, each of the
cluster is analysed for higher and lower coefficients and understand the relation between the principal components and the cluster segments.
Figure 4
The clusters are analysed and derived with respect to the loan data
variables. Figure 4, displays the customer segment identified after
the analysis of the coefficient matrix. These are the major segments
of the loan data
• Credit based – revolving accounts
• Fixed instalment based loan accounts
• accounts who are mostly past due of credit and late fees
• accounts who are highly inquired
• accounts who more than 75% and creates many new accounts
Further PROC UNIVARIATE is executed with the new cluster dataset
and the output are approximately same with respect to the box plot.
Hence it is ensured that the segments are almost correct
Figure 6 Boxplot of Percentage greater than 75 over all clustersFigure 5 Boxplot of instalment accounts over all clusters
SCORINGTHENEWDATA
The new data is then scored with the old statistics and the segments are identified. The scoring of new data set consists of the following steps:
• The outputs stats from the PRINCOMP is used to score the new dataset
• The output from STDIZE is used as input to standardize the new scored dataset
• The output stat from the FASTCLUS is used as input stat for the new dataset
Figure 7 displays the frequency distribution of mean across the new and old dataset for comparison. It is observed that the clusters are
approximately the same and the segments have been identified correctly.
OLD DATA
NEW DATA
LEARNINGS
Identifying the principal components is complex and after clustering the same
gives a much more clear picture
With very less business knowledge, identifying the clusters and the segment
verification was difficult
Learnt how to write a macro to run the clusters from 3 to 20 and then identify the
best one from the batch
Use of UNIVARIATE was a revelation when my segments matched with the box
plot even though I am not sure if the segments are correct as such.
APPENDIX1–EIGENVALUESWHENCURVECHANGES
APPENDIX2–EIGENVECTORS OFFIRST10PRINCIPALCOMPONENTS

More Related Content

What's hot

Pca(principal components analysis)
Pca(principal components analysis)Pca(principal components analysis)
Pca(principal components analysis)kalung0313
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-Ihktripathy
 
Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchEshanAgarwal4
 
Data Transformation.ppt
Data Transformation.pptData Transformation.ppt
Data Transformation.pptVishal Yadav
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)Learnbay Datascience
 
Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regressiondessybudiyanti
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component AnalysisSunjeet Jena
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionsaba khan
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimationguestfee8698
 
Probability Distributions
Probability DistributionsProbability Distributions
Probability DistributionsCIToolkit
 

What's hot (20)

Pca(principal components analysis)
Pca(principal components analysis)Pca(principal components analysis)
Pca(principal components analysis)
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
Transformation of variables
Transformation of variablesTransformation of variables
Transformation of variables
 
PCA
PCAPCA
PCA
 
Implement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratchImplement principal component analysis (PCA) in python from scratch
Implement principal component analysis (PCA) in python from scratch
 
Data Transformation.ppt
Data Transformation.pptData Transformation.ppt
Data Transformation.ppt
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
 
Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regression
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component Analysis
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
 
Regression
RegressionRegression
Regression
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
 
Pca
PcaPca
Pca
 
Probability Distributions
Probability DistributionsProbability Distributions
Probability Distributions
 

Viewers also liked

Steps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS softwareSteps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS softwareSwetha A
 
Colgate Precision - Harvard Business Case Analysis
Colgate Precision - Harvard Business Case AnalysisColgate Precision - Harvard Business Case Analysis
Colgate Precision - Harvard Business Case AnalysisUsha Vijay
 
Visual Merchandising - Marketing Research
Visual Merchandising - Marketing ResearchVisual Merchandising - Marketing Research
Visual Merchandising - Marketing ResearchUsha Vijay
 
Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...zukun
 
Regularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial DataRegularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial DataWen-Ting Wang
 
Hosting Dergi - 9.SAYI
Hosting Dergi - 9.SAYIHosting Dergi - 9.SAYI
Hosting Dergi - 9.SAYIHosting Dergi
 
Olena teliga pr.-konf.
Olena teliga pr.-konf.Olena teliga pr.-konf.
Olena teliga pr.-konf.TOBM Ternopil
 
Colgate-Palmolive Company: The Precision Toothbrush
Colgate-Palmolive Company: The Precision ToothbrushColgate-Palmolive Company: The Precision Toothbrush
Colgate-Palmolive Company: The Precision ToothbrushPriyadarsini Somasundaram
 
Ejercicio 2 programación algoritmos Valentino Spina.
Ejercicio 2 programación  algoritmos Valentino Spina.Ejercicio 2 programación  algoritmos Valentino Spina.
Ejercicio 2 programación algoritmos Valentino Spina.Valentino Spina
 
Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...zukun
 
fauvel_igarss.pdf
fauvel_igarss.pdffauvel_igarss.pdf
fauvel_igarss.pdfgrssieee
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemMichele Filannino
 
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdfKernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdfgrssieee
 
Different kind of distance and Statistical Distance
Different kind of distance and Statistical DistanceDifferent kind of distance and Statistical Distance
Different kind of distance and Statistical DistanceKhulna University
 
Dem ham bang odontosil final - hn 042016
Dem ham bang odontosil   final - hn 042016Dem ham bang odontosil   final - hn 042016
Dem ham bang odontosil final - hn 042016DentechUMP
 

Viewers also liked (20)

Steps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS softwareSteps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS software
 
Colgate Precision - Harvard Business Case Analysis
Colgate Precision - Harvard Business Case AnalysisColgate Precision - Harvard Business Case Analysis
Colgate Precision - Harvard Business Case Analysis
 
Visual Merchandising - Marketing Research
Visual Merchandising - Marketing ResearchVisual Merchandising - Marketing Research
Visual Merchandising - Marketing Research
 
Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...Principal component analysis and matrix factorizations for learning (part 1) ...
Principal component analysis and matrix factorizations for learning (part 1) ...
 
Regularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial DataRegularized Principal Component Analysis for Spatial Data
Regularized Principal Component Analysis for Spatial Data
 
Hosting Dergi - 9.SAYI
Hosting Dergi - 9.SAYIHosting Dergi - 9.SAYI
Hosting Dergi - 9.SAYI
 
Olena teliga pr.-konf.
Olena teliga pr.-konf.Olena teliga pr.-konf.
Olena teliga pr.-konf.
 
Mi auto biografía
Mi auto biografíaMi auto biografía
Mi auto biografía
 
ting-cert-BI
ting-cert-BIting-cert-BI
ting-cert-BI
 
Colgate-Palmolive Company: The Precision Toothbrush
Colgate-Palmolive Company: The Precision ToothbrushColgate-Palmolive Company: The Precision Toothbrush
Colgate-Palmolive Company: The Precision Toothbrush
 
Ejercicio 2 programación algoritmos Valentino Spina.
Ejercicio 2 programación  algoritmos Valentino Spina.Ejercicio 2 programación  algoritmos Valentino Spina.
Ejercicio 2 programación algoritmos Valentino Spina.
 
Reglamento interno itei 2014
Reglamento interno itei 2014Reglamento interno itei 2014
Reglamento interno itei 2014
 
Panorama sobre Teste de Software
Panorama sobre Teste de SoftwarePanorama sobre Teste de Software
Panorama sobre Teste de Software
 
2° informe s. gabriel 2014
2° informe s. gabriel 20142° informe s. gabriel 2014
2° informe s. gabriel 2014
 
Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...
 
fauvel_igarss.pdf
fauvel_igarss.pdffauvel_igarss.pdf
fauvel_igarss.pdf
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problem
 
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdfKernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
 
Different kind of distance and Statistical Distance
Different kind of distance and Statistical DistanceDifferent kind of distance and Statistical Distance
Different kind of distance and Statistical Distance
 
Dem ham bang odontosil final - hn 042016
Dem ham bang odontosil   final - hn 042016Dem ham bang odontosil   final - hn 042016
Dem ham bang odontosil final - hn 042016
 

Similar to Principal Component Analysis and Clustering

Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom IndustrySatyam Barsaiyan
 
Statistics final seminar
Statistics final seminarStatistics final seminar
Statistics final seminarTejas Jagtap
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...Smarten Augmented Analytics
 
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithminventionjournals
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMSAli T. Lotia
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningIRJET Journal
 
Final SAS Day 2015 Poster
Final SAS Day 2015 PosterFinal SAS Day 2015 Poster
Final SAS Day 2015 PosterReuben Hilliard
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET Journal
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET Journal
 
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...ijmvsc
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project reportGaurav Sawant
 
Predictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation AmountsPredictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation AmountsMichele Vincent
 
Predicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation AmountsPredicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation AmountsMichele Vincent
 

Similar to Principal Component Analysis and Clustering (20)

Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
 
Statistics final seminar
Statistics final seminarStatistics final seminar
Statistics final seminar
 
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
What is Hierarchical Clustering and How Can an Organization Use it to Analyze...
 
Building the Professional of 2020: An Approach to Business Change Process Int...
Building the Professional of 2020: An Approach to Business Change Process Int...Building the Professional of 2020: An Approach to Business Change Process Int...
Building the Professional of 2020: An Approach to Business Change Process Int...
 
Eviews forecasting
Eviews forecastingEviews forecasting
Eviews forecasting
 
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm	Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
 
Final SAS Day 2015 Poster
Final SAS Day 2015 PosterFinal SAS Day 2015 Poster
Final SAS Day 2015 Poster
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
 
Data Science Using Python
Data Science Using PythonData Science Using Python
Data Science Using Python
 
JEDM_RR_JF_Final
JEDM_RR_JF_FinalJEDM_RR_JF_Final
JEDM_RR_JF_Final
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
 
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report
 
Predictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation AmountsPredictive Analytics, Predicting LIkely Donors and Donation Amounts
Predictive Analytics, Predicting LIkely Donors and Donation Amounts
 
Predicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation AmountsPredicting Likely Donors and Donation Amounts
Predicting Likely Donors and Donation Amounts
 

Recently uploaded

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 

Recently uploaded (20)

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 

Principal Component Analysis and Clustering

  • 1. Principal Component Analysis and Clustering Professor Daymond 27-Nov-2016
  • 2. UNDERSTANDING BORROWER SEGMENTS Majority of the accounts are of credit based borrowers whose revolving utilization with the most revolving accounts and bankcards Credit based accounts The accounts are mostly with fixed instalments like car loans, student loans etc., Most instalment accounts and instalment utilization are the major factors of this segment Fixed Instalment accounts These are borrowers with past due records and most of the late fees of credit and loan amount. Also with the recent history of delinquency this segment is medium risk Past due accounts These are borrowers who are highly inquired for loans which exhibits the most credit card purchase behaviour and attempt to try all possible loans for one Highly Inquired accounts Debt to collection accounts holds the most number of public records like tax liens etc., Collections money owed and tax liens are the major factors of this segment With highest delinquency, exceeded usage of credit limit and multiple accounts in the recent times makes this segment as high risk Debt Collections accounts High risk delinquent accounts
  • 3. IDENTIFYINGTHEPRINCIPALCOMPONENTS With the given dataset(N=27000) and 77 variables, it is important to reduce the data set to a smaller set of variables to derive a feasible conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can be envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the data based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is executed with all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the principal components The variance of each principal component is implied Eigen values of the component. The greater the Eigen values, the better the variance is explained by each component. Hence the break point criteria for components is that the Eigen values must be greater than 1 and the cumulative variance should be at least 75%. From the results(Appendix 1), it is observed that there are 18 components with Eigen values greater than 1 and contribute to approximately 76% of the total variance. The coefficients of the principal components are the Eigen vectors(Appendix 2) generally the linear combination of the inputs which implies the axis length and the direction of each principal components .From figure 1, scree plot it is observed that curve is almost flat after Eigen value 1 implying that the further components contribute very small to the variance. Hence there are total of 18 principal components that provides a significant variance of data Figure 1 Figure 2
  • 4. INTERPRETINGTHEPRINCIPALCOMPONENTS In order to interpret the principal components, the correlation matrix of the Eigen Vectors is observed for highest correlation with the original variables. The data is standardized by PRINCOMP and hence the correlation matrix has values lesser than 1. The values closer to an absolute 1 i.e. either positive or negative are said to be highly correlated with the original variables. PRINCIPAL COMPONENT 1 From Figure 4, it is observed that the highest coefficients are correlated with the various number of accounts i.e. how valuable are the customers in terms of usage and the least correlated with the duration since the recent account i.e. how credible the customers are? Similarly, each of the principal component is analysed for the highest and the lowest coefficients and tabulated for reference. Figure 3 Figure 4
  • 5. IDENTIFYINGTHECLUSTERS Once the principal components are identified, the next step is to feed the principal components in to a cluster and run the FASTCLUS procedure with various MAXCLUSTERS size ranging from 3 to 20 after PROC STDIZE. FASTCLUS uses k-means clustering, an iterative approach helps to identify the approximately equal sized clusters with a decent spread. A set of values are selected as Initial Seeds for reference i.e. mean and then the nearest values are formed as temporary clusters and replaced with the mean of new clusters and this is repeated iteratively until there is no change in clusters. ‘Complete convergence is satisfied’ implies that the final SEEDS is equal to the cluster mean. Summary The summary of statistics of clusters displays the frequency of observations in each cluster and the root mean square deviation. The next column displays the largest distance from the seed to the observation i.e. the total spread of the cluster approximately. The last column displays the distance from the centre of the cluster to the centre of the nearest cluster. Six appropriate sized clusters are obtained with 14 clusters and at 35th iteration. Cluster 1, 4, 6, 9, 12, 14 are the identified clusters and Cluster 1 is observed to be the nearest cluster for all the clusters Goodness-of-fit metrics The higher values of Pseudo F Statistic are preferred to attain good number of clusters R-square accounts for the variance accounted by the clusters The higher CCC values are indicate good clustering generally expected to be more than 2 or 3. Higher F Statistic and CCC implies that the clustering solution is good
  • 6. IDENTIFYINGTHECLUSTERS Cluster means and standard deviation of variables are displayed as part of FASTCLUS. Similar to identifying the principal components, each of the cluster is analysed for higher and lower coefficients and understand the relation between the principal components and the cluster segments. Figure 4 The clusters are analysed and derived with respect to the loan data variables. Figure 4, displays the customer segment identified after the analysis of the coefficient matrix. These are the major segments of the loan data • Credit based – revolving accounts • Fixed instalment based loan accounts • accounts who are mostly past due of credit and late fees • accounts who are highly inquired • accounts who more than 75% and creates many new accounts Further PROC UNIVARIATE is executed with the new cluster dataset and the output are approximately same with respect to the box plot. Hence it is ensured that the segments are almost correct Figure 6 Boxplot of Percentage greater than 75 over all clustersFigure 5 Boxplot of instalment accounts over all clusters
  • 7. SCORINGTHENEWDATA The new data is then scored with the old statistics and the segments are identified. The scoring of new data set consists of the following steps: • The outputs stats from the PRINCOMP is used to score the new dataset • The output from STDIZE is used as input to standardize the new scored dataset • The output stat from the FASTCLUS is used as input stat for the new dataset Figure 7 displays the frequency distribution of mean across the new and old dataset for comparison. It is observed that the clusters are approximately the same and the segments have been identified correctly. OLD DATA NEW DATA
  • 8. LEARNINGS Identifying the principal components is complex and after clustering the same gives a much more clear picture With very less business knowledge, identifying the clusters and the segment verification was difficult Learnt how to write a macro to run the clusters from 3 to 20 and then identify the best one from the batch Use of UNIVARIATE was a revelation when my segments matched with the box plot even though I am not sure if the segments are correct as such.