Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Customer Segmentation

BANK
CUSTOMER
SEGMENTATION
Research Project 1

INTRODUCTION
 I got this dataset from Kaggle website. This dataset is all
about transactions.
 Most banks have a large customer base - with different
characteristics in terms of age, income, values, lifestyle, and
more.
 Customer segmentation is the process of dividing a customer
dataset into specific groups based on shared traits.
 This process allows financial institutions to better understand
their customers and tailor their products, services, and
marketing strategies to meet the unique requirements of each
segment.
 Customer understanding should be a living, breathing part of
everyday business, with insights underpinning the full range of
banking operations.

CONTENT
 Importing Libraries
 Dataset Features
 EDA (Exploratory Data Analysis)
 Visualization
 Manipulating Data
 Dealing with “Null” Values
 Encoding the Categorical Data
 KMeans
 DBSCAN
 Conclusion

IMPORTING LIBRARIES
We will be using the following libraries :
 Pandas Library :-
It is useful for Data Processing and Analysis.
 Pandas Data frame :-
It is a Two-Dimensional tabular data structured
with labeled axes(rows and columns).
 Seaborn :-
It is useful for Data Visualization.
 Numpy :-
It is a Python library used for working
with Arrays.
 Matplotlib.pyplot :-
It is useful for making Plots.

DATASET FEATURES
 TransactionID
 CustomerID
 CustomerDOB
 CustGender
 CustLocation
 CustAccountBalance
 TransactionDate
 TransactionTime
 TransactionAmount (INR)

EDA (EXPLORATORY DATA ANALYSIS)
 Exploratory Data Analysis (EDA) is a crucial phase in the data analysis process, where analysts and data
scientists examine and summarize the main characteristics of a dataset.
 EDA plays a pivotal role in hypothesis generation, data cleaning, and guiding the selection of appropriate
modeling techniques, ultimately facilitating more informed and effective decision-making processes based
on a solid understanding of the data at hand.

 As we can see there some null values in “CustomerDOB” , “CustGender” and
“CustAccountBalance” . We will treat it further.
 Then we use describe function, with the help of this function we will get Count, mean, minimum,
maximum and some more statistical values of numeric column.

VISUALIZATION
 Seaborn : It is useful for making Plots.
1. Heat Map or Co-relation Matrix : With the help of heat map we can see the co-relation between each
column in dataset.
2. Histplot : This type of plot displays the distribution of a dataset by dividing it into bins and representing
the frequency of data points within each bin with bars, providing insights into the underlying data
distribution.
3. As we can see in histplot about customer gender, there are more male customers as compared to
female customers

MANIPULATING DATA
 Manipulating data involves transforming, cleaning or organizing information within a dataset to extract
meaningful insights.
 There is column “TransactionDate” I changed his type to datetime.
 With the help of this column I created three new columns “transaction_year”, “transaction_month” and
“transaction_day”.
 After all the process I deleted or drop that columns which are not useful or not matter for machine
learning model

DEALING WITH “NULL” VALUES
 As we saw in EDA there are some null values in “CustAccountBalance” and “CustGender”.
 I filled “CustAccountBalance” null values with “0” value cause account balance is very sensitive part
in transactions and we can’t just filled it with assumptions cause this will mislead us.
 “CustGender” is a categorical column so null values of this column can’t filled with mean or median.
This null values can only filled with mode value of that column.

ENCODING THE CATEGORICAL DATA
 The process of converting categorical data into numerical data form is called “Categorical Encoding.
 There are few methods of categorical encoding like Label encoding and One-Hot encoding.
 I choose label encoding instead of one-hot encoding cause it makes data too complicated.
 After deleting or dropping some columns, now there are only two categorical columns which we
have too encode or convert into numeric column. The two columns are “ CustGender” and
“Custloaction” .
 This is how our data looks like after all preprocessing and encoding the categorical data.

KMEANS
 K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a
dataset into a set of distinct, non-overlapping subgroups or clusters.
 The primary goal of K-means is to group similar data points together and assign them to clusters based
on certain features or attributes.
 Deciding clusters is one of the critical and
important part in KMeans algorithm.
 There is a method for deciding number of
cluster which called Elbow Method.
 Elbow Method: It involves plotting the
Within-Cluster Sum of Squares (WCSS)
against different values of k and identifying
the "elbow point," where the reduction in
WCSS starts to slow down.
 So in this dataset according to elbow
method the number of cluster should be 2
which are based on customer gender “Male”
and “Female”.
This will not very helpful or making sense.

 After observing and studying the dataset I find out there are total twenty unique locations in
customer location column.
 So I decided to make 20 clusters cause it will make some sense for the machine learning model.
 After making twenty cluster I check the “Silhouette Score” metric.
 This metric is used to assess the quality of clusters in clustering methods.
 The Silhouette score for this algorithm is 69.83% which is decent score.

DBSCAN
 DBSCAN, or Density-Based Spatial Clustering of Applications with Noise, is a popular unsupervised
machine learning algorithm used for clustering spatial data points based on their density distribution.
 Unlike K-means, DBSCAN does not require specifying the number of clusters in advance. Instead, it
defines clusters as dense regions separated by areas of lower point density.

CONCLUSION
 KMeans algorithm works more better than
DBCSAN(Density-Based Spatial Clustering of
Applications with Noise).
 We made 20 clusters in KMeans algorithm based on
customer location. Which are helpful for bank to
target those locations for making promotion through
ads or creating new exciting offers or policies from
where the most of transactions or huge amount of
transactions were done.
 DBSCAN algorithm is not resulting good as his
silhouette score comes in negative.
 Silhouette score of DBSACN comes negative cause
DBSCAN is not good for high density datasets.
 This all information is enough to choose KMeans
algorithm instead of DBSCAN algorithm.

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Customer Segmentation

Recommandé

Recommandé

Contenu connexe

Similaire à Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Customer Segmentation

Similaire à Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Customer Segmentation (20)

Plus de Boston Institute of Analytics

Plus de Boston Institute of Analytics (20)

Dernier

Dernier (20)

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Customer Segmentation