SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
Implementation of Integrated Approach of
K-means Clustering Algorithm for
Prediction Analysis
For Partial Fulfillment for the Degree to be
awarded by
Gujarat Technological University
Presented by
Manisha Goyal(160130702006)
Carried Out at
Government Engineering College, Gandhinagar
Under the Supervision of
Prof. M.B. Chaudhari
Dissertation Phase-II Presentation
On
Layout
 Motivation and Objective of research work
 Theoretical Background
 Literature Review and Comparative Study
 Problem Identification
 Existing v/s Proposed Methodology
 DP-1 and MSR Comments with Solutions
 Implementation of Proposed Work
 Results Analysis
 Conclusion
 Bibliography
 Paper Publication Certificate
Motivation of Work
K-means is very old concept which is increasing in popularity day by day because of its simplicity and
linear time complexity. However, it has two main disadvantages :- 1) Its highly sensitive to outlier and 2)
Its highly dependent on initialization parameters (random choice of k clusters and position of initial cluster
centroids). Many improved variants of K-means method are detailed in literature but still it is an open field
of research because of its extensive application in the field of Medical, Business & Marketing, Social
Media –Sentiment Analysis etc.
Overlapping K-means is an extended version of K –means, which is fairly a new concept and widely using
in various field where overlapping clusters are require. As it is an extension of K-means, it also needs
improvements. There is a lot of scope for improvement as far its accuracy and dependability is
concerned.
Objective
“The goal of this research work is to improve the accuracy of existing overlapping K-means
Clustering by removing its dependency on initialization parameters (random choice of k clusters
and placement of initial cluster centroids) and to evaluate the results using different measures for
different applications”.
To achieve this objective, the proposed algorithm performs the following steps: -
1) Preprocessing of raw dataset.
2) Calculate the optimum value of K (entirely based on the dataset, NOT as input from user).
3) Find position of initial centroids (using Proposed Harmonic Means method, not as random input
from user) and then using above results apply OKM.
Chapter 1
Theoretical Background
Clustering
• Objective: To find natural groupings among objects
• It is an unsupervised learning problem which deals with finding a structure in a collection of
unlabelled data.
• Organize data in such a way that there is a
Clustering category based on
generated clusters
1. Exclusive (Non-overlapping) Clustering
2. Overlapping Clustering
Why Overlapping Clustering??
Most of existing clustering methods assume that each data observation belongs to one and only
one cluster leading to k disjoint clusters explaining the data. However, in many applications the
data being modeled can have a much richer and more complex hidden representation where
observations actually belong to multiple clusters.
•In Social Network Analysis, community extraction algorithms need to detect overlapping clusters
where an actor can belong to multiple communities..
• In Text Clustering, learning methods should be able to group document, which discuss more than
one topic, into several groups.
•In Medical Domain, various diseases share some common overlapping symptoms such as fever is
common symptom in typhoid, malaria, viral infection and many others.
K-means clustering
• Partitioning method for clustering.
• Objective: takes the input parameter, k, and partitions a set of n objects into k clusters.
• Dissimilarity measures of K-means are:
Euclidean Distance
Manhattan Distance
• Cluster mean is used to update the centroid of that cluster.
• The aim of K-means is to minimize the objective function or the square-error criterion, defined as:
𝐸 = 𝑖=1
𝑘
𝑝∈𝑐𝑖
𝑝 − 𝑚𝑖
2
Where E is the sum of the square error for all objects in the data set; p is the point in space representing a
given object; and mi is the mean of cluster Ci (both p and mi are multidimensional).
How does K-means Work??
1. Initialization:
• Randomly Choose cluster centroid for K=2
2. Cluster Assignment:
• Compute the distance between the data points and the cluster centroid by using dissimilarity
measures.
• Depending upon the minimum distance, data points are divided into 2 clusters.
3. Move Centroid:
• Compute the mean of blue dots and reposition blue centroid to this mean
• Compute the mean of orange dots and reposition orange centroid to this mean
4. Optimization and Convergence:
• Repeat previous 2 steps iteratively till the cluster centroid stop changing their position.
• At some point cluster does not change for further computation that is the point when algorithm
converges
• Below is the final cluster.
Advantages and Disadvantages
Advantages:
• Easy to implement and robust.
• Relatively scalable and efficient in processing large data sets with linear time complexity.
• Produce tighter clusters than hierarchical clustering.
Disadvantages:
• Applied only when the mean of a cluster is defined.
• Cannot be applied on categorical attributes.
• Sensitive to the selection of number of a clusters k and initial cluster center.
• Not suitable for discovering clusters with nonconvex shapes or clusters of very different size.
• Sensitive to noise and outlier data points.
Overlapping K-means (OKM)
• The OKM method extends the objective function used in K-means to consider the possibility of
overlapping clusters.
• The K-means algorithm aims at clustering 𝑋 = 𝑥𝑖, … . . 𝑥 𝑛 into k clusters by minimizing the
following objective function
𝑄 𝜋 = 𝑗=1
𝑘
𝑥 𝑖∈𝜋 𝑗
𝑥𝑖 − 𝑧𝑗
2
Where, 𝑥𝑖 is a ʋ-dimensional set of observations, 𝜋 = { 𝜋1, … … 𝜋 𝑘 } is the set of k clusters (𝜋𝑖 ∩
𝜋𝑗 = ∅), and 𝑍 = {𝑧1, … . 𝑧 𝑘} is the set of cluster centroids.
Cont…
• The OKM approach relaxes the objective function of K-means to allow overlapping by removing
the constraint 𝜋𝑖 ∩ 𝜋𝑗 = ∅, for 𝑖 ≠ 𝑗.
• Objective function of OKM is defined as
𝑄′ 𝜋 = 𝑗=1
𝑛
𝑥𝑖 − ɸ 𝑥𝑖
2
• The ɸ 𝑥𝑖 = (ɸ1 𝑥𝑖 , … . ɸ 𝑟 𝑥𝑖 ) is the representation of 𝑥𝑖 also called ‘image’ or ‘barycenter
of cluster's’ of 𝑥𝑖 defined as a combination of the centroids 𝑧𝑗 of the clusters 𝜋𝑗 where 𝑥𝑖 belongs
to, computed as
ɸ 𝑝 𝑥𝑖 =
𝑧 𝑗 𝜖𝜋(𝑥 𝑖) 𝑧 𝑗
|𝜋 𝑥 𝑖 |
,
Cont…
Here, the centroid 𝑧𝑗 ∈ 𝜋(𝑥𝑖) , where 𝜋(𝑥𝑖) is the list of all clusters that 𝑥𝑖 belongs to. The
centroid 𝑧𝑗 is updated using the following equation:
𝑧𝑗
∗
=
1
𝑥 𝑖∈𝜋 𝑗
1
𝛿 𝑖
2
𝑥 𝑖 𝜖𝜋 𝑗
1
𝛿 𝑖
2 (𝛿𝑖 × 𝑥𝑖 − 𝑧 𝑗∈𝜋(𝑥 𝑖)/𝑧 𝑗
𝑧𝑗),
Where 𝛿𝑖 is the total number of clusters that 𝑥𝑖 belongs to (in this case 𝛿𝑖 = |𝜋 𝑥𝑖 |).
Evaluation metrics
1. Sum of Square Error (SSE)
2. between_ss/total_ss Ratio
3. Number of Iterations
4. F-Measures and FBCubed Measures
5. Rand Index
Chapter 2
Literature Survey
Literature Review and Comparative Study
Title Publication
1. Applications of Partition based Clustering Algorithms: A Survey IEEE 2013
2. Performance Evaluation of a Novel Hybrid Clustering Algorithm using Birch and K-
Means
IEEE 2015
3. Disease Prediction using Hybrid K-means and Support Vector Machine IEEE 2016
4. Sorted K-Means Towards the Enhancement of K-Means to Form Stable Clusters Springer 2017
5. An enhanced deterministic K-means clustering algorithm for cancer subtype prediction
from gene expression data
Elsevier 2017
6. An Improved Overlapping k-Means Clustering Method for Medical Applications Elsevier 2017
Title Techniques used Strength Weakness
1. Performance Evaluation of
a Novel Hybrid Clustering
Algorithm using Birch and
K-Means
K-means clustering
algorithm and BIRCH
 Better performance than K-
Means and K-Medoid
clustering
 It can handle large datasets
more effectively.
 Results are vary with k
values
 Computation time can be
reduced further
2. Disease Prediction using
Hybrid K-means and
Support Vector Machine
hybrid K-means algorithm
which uses the silhouette
values to find k values
initial centroids and Support
Vector Machine algorithm
The K-means achieved the
accuracy of 82% and the hybrid
algorithm achieved the accuracy
of 92% on the same dataset
Accuracy can further be improved
by using improved k-Means
algorithm and SVM.
3. Sorted K-Means Towards
the Enhancement of K-
Means to Form Stable
Clusters
Sorted (Merge or Quick
sort) K-Means which
determines initial centroids
Effectively and efficiently used
to form stable clusters with less
number of iterations.
Space and Time complexity for
big data can be improved further
Title Techniques used Strength Weakness
4. An enhanced
deterministic K-means
clustering algorithm for
cancer subtype prediction
from gene expression data
density based version of
K-Means
The overall performance to
the others compared
algorithms
It does not deal with outliers
5. An Improved
Overlapping k-Means
Clustering Method for
Medical Applications
k-harmonic means and
overlapping k-means
algorithms (KHMOKM)
 Better performance than
OKM
 Better minimization of
objective function
 Rely on the Euclidean
distance.
 The algorithm depends on
the initial selection of the
number of clusters k.
Chapter 3
Problem Identification
Some of the following problems, identified during literature review are as follows.
1. Computation time of integrated approach is a big issue because integrating two approaches increases
time complexity which is not acceptable. [paper2]
2. Some algorithms do not deal properly with outliers which decreases the accuracy overall for large
datasets. [paper5]
3. Some algorithms do not work well with large datasets because of space complexity issue. [paper3]
4. Its increases the complexity of the algorithm when integrated KHM-OKM method is used. [paper6]
5. Most algorithms rely on Euclidean distances to find closest distances from centroids which is not
suitable for all types of datasets.
Chapter 4
Existing v/s Proposed Methodology
Chapter 5
DP-1 and MSR Comments
Sr. NO. DP1 Comments by External Status
1. Good Literature Review
2. Detailed Algorithm needs to be prepared. Done in MSR
3. Evaluate Complexity of your proposed approach Done in MSR
Sr. NO. MSR Comments by External Status
1. Proposed Algorithm needs to be implemented with
sufficiently large dataset
Done
2. Implementation and results of work should be displayed
during DP-2
Done
MSR Comments with solution
Comment 1: Proposed Algorithm needs to be implemented with sufficiently large dataset.
Solution- We have taken following two large datasets ( Lung Cancer and Diabetes Disease):
Sr. No. Dataset Name Size
1. Lung Cancer Dataset 1000*25
2. Diabetes Disease Dataset 768*9
Comment 2: Implementation and results of work should be displayed during DP-2
Solution: Implementation of Proposed algorithm and Results are described in succeeding
sections.
Lung Cancer Dataset
Diabetes Disease Dataset
Chapter 6
Implementation of Proposed Work
1. Coding and Analysis is done in RStudio
2. Implementation of Integrated OKM in Weka4OC toolbox
3. Recording and analyzing Results in Excel sheet 2013
Step 1: Import Dataset in R Studio and Analysis on it (Lung Cancer Dataset).
Step 2: Apply Methods to find K- values.
Step 3: Apply Proposed Harmonic Mean Method to find Initial Centroids.
Cont…(step 3)
Chapter 7
Result Analysis
(Original OKM v/s Integrated OKM)
1. Existing Methodology (original OKM)
Random users inputs
Run the above random inputs in Weka4Oc tool
2. Proposed Integrated OKM methodology
Step 1: Determine appropriate k-value through algorithm
Step 2: Calculation of initial centroid position through proposed harmonic means method
Step 3: Run the inputs generated through above algorithm in Weka4Oc
Proposed OKM Methodology
Step 1: Determine appropriate K Value through algorithm
Ratio of between _SS/Total_SS
Methods to find K Lung Cancer Dataset Diabetes Data Set
K=3_ Elbow Method 50.70% 74.90%
K=2_Silhoutte Method 39.30% 55.70%
K=1_Gap static Method 0.00% 0.00%
Step 2 : Calculation of initial centroid position through Proposed Harmonic
Means (post pre-processing, if required)
Step 3: Input best value of K and initial centroid position calculated from above
step in Weka4Oc
Comparison of Results:
Lung Cancer Dataset
‘8 different scenarios where 4 users enter random values in 8 different ways’ V/S the ‘integrated OKM algorithm
R
A
N
D
O
M
INPUTS
PROPOSED
OKM
User
Value
of K
Initial Centorid position
Value of Initial
Centroid position
precision Recall
F-
measure
Rand
Index
Bcubed
Precision
#
Iteration
s
User 1_R 2 Randomly generated NA 0.143 0.904 0.247 0.244 0.107 11
User 1_U 2
Random Input value by
user
56,789 0.141 0.911 0.243 0.223 0.114 8
User 2_R 3 Randomly generated NA 0.159 0.777 0.264 0.352 0.116 7
User 2_U 3
Random Input value by
user
45,578,899 0.142 0.798 0.242 0.253 0.116 9
User 3_R 4 Randomly generated NA 0.169 0.852 0.283 0.293 0.1 9
User 3_U 4
Random Input value by
user
23,456,678,890 0.169 0.801 0.278 0.322 0.113 11
User 4_R 5 Randomly generated NA 0.188 0.853 0.309 0.325 0.101 9
User 4_U 5
Random Input value by
user
23, 456, 658, 123,
897
0.196 0.631 0.3 0.479 0.173 8
0.163375 0.815875 0.27075 0.311375 0.1175 9
Integrated
OKM
algorithm
3 Harmonic Mean method 1,329,685 0.167 0.92 0.283 0.438 0.175 3
Average
Graphs for Comparison of Results: Lung Cancer Data Set (1)
Graphs for Comparison of Results: Lung Cancer Data Set (2)
Graphs for Comparison of Results: Lung Cancer Data Set (3)
Comparison of Results:
Diabetes Disease Dataset
‘8 different scenarios where 4 users enter random values in 8 different ways’ V/S the ‘integrated OKM algorithm
R
A
N
D
O
M
INPUT
S
ROPOSED
OKM
User ValueofK InitialCentorid position
ValueofInitial
Centroid position
precision Recall
F-
measure
Rand
Index
Bcubed
Precision
#Iterations
User 1_R 2 Randomlygenerated NA 0.117 0.895 0.206 0.171 0.111 19
User 1_U 2 Random Inputvaluebyuser 54,678 0.117 0.895 0.206 0.171 0.111 18
User 2_R 3 Randomlygenerated NA 0.11 0.742 0.192 0.248 0.099 14
User 2_U 3 Random Inputvaluebyuser 16,390,481 0.116 0.815 0.203 0.229 0.095 25
User 3_R 4 Randomlygenerated NA 0.115 0.656 0.196 0.352 0.114 25
User 3_U 4 Random Inputvaluebyuser 4,317,590,712 0.111 0.782 0.194 0.217 0.098 29
User 4_R 5 Randomlygenerated NA 0.11 0.61 0.187 0.361 0.108 23
User 4_U 5 Random Inputvaluebyuser 5,18,386,495,600 0.113 0.721 0.195 0.283 0.105 38
0.113625 0.7645 0.197375 0.254 0.105125 23.875
Integrated OKM
Method
3 HarmonicMean method 1,254,524 0.131 0.802 0.226 0.337 0.099 8
Average
Graphs for Comparison of Results: Diabetes Disease Data Set (1)
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
precision
precision
Proposed OKM algorithm 0.131
User 4_U 0.113
User 4_R 0.11
User 3_U 0.111
User 3_R 0.115
User 2_U 0.116
User 2_R 0.11
User 1_U 0.117
User 1_R 0.117
PRECISON
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Recall
Proposed OKM algorithm 0.802
User 4_U 0.721
User 4_R 0.61
User 3_U 0.782
User 3_R 0.656
User 2_U 0.815
User 2_R 0.742
User 1_U 0.895
User 1_R 0.895
RECALL
Graphs for Comparison of Results: Diabetes Disease Data Set (2)
0 0.05 0.1 0.15 0.2 0.25
F-measure
F-measure
Proposed OKM algorithm 0.226
User 4_U 0.195
User 4_R 0.187
User 3_U 0.194
User 3_R 0.196
User 2_U 0.203
User 2_R 0.192
User 1_U 0.206
User 1_R 0.206
F MEASURE
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Rand Index
Rand Index
Proposed OKM algorithm 0.337
User 4_U 0.283
User 4_R 0.361
User 3_U 0.217
User 3_R 0.352
User 2_U 0.229
User 2_R 0.248
User 1_U 0.171
User 1_R 0.171
RAND INDEX
Graphs for Comparison of Results: Diabetes Disease Data Set (3)
0 0.02 0.04 0.06 0.08 0.1 0.12
Bcubed Precision
Bcubed Precision
Proposed OKM algorithm 0.099
User 4_U 0.105
User 4_R 0.108
User 3_U 0.098
User 3_R 0.114
User 2_U 0.095
User 2_R 0.099
User 1_U 0.111
User 1_R 0.111
B CUBED PRECISION
0 5 10 15 20 25 30 35 40
# Iterations
# Iterations
Proposed OKM algorithm 8
User 4_U 38
User 4_R 23
User 3_U 29
User 3_R 25
User 2_U 25
User 2_R 14
User 1_U 18
User 1_R 19
No. OF ITERATIONS
Conclusion
The thesis was robust enough to show positive results as it: (i) removes the dependency of the
method from any random input parameters but also (ii) normalizes the outliers .
From the above results we found that, barring one or two accuracy measures, the performance of
the Proposed integrated OKM tool is better than usual OKM method.
 We can also observe that integrated OKM helps us in reducing the time complexity in both
cases as the number of integrations are reducing greatly.
As far as future work is concerned, this thesis provides a base for further research on effective
improved clustering which can create a long lasting positive impact on medical field and many
other fields.
Bibliography
PAPERS:
1. Argenis A. Aroche-Villarruel1, J.A. Carrasco-Ochoa1, José Fco. Martínez-Trinidad1,J. Arturo Olvera-López2, and Airel Pérez-Suárez3, “Study of
Overlapping Clustering Algorithms Based on Kmeans through FBcubed Metric”, Springer International Publishing Switzerland 2014
2. A.Dharmarajan, T. Velmurugan, “Applications of Partition based Clustering Algorithms: A Survey” 2013 IEEE
3. Jaskaranjit Kaur and Harpreet Singh, “Performance Evaluation of a Novel Hybrid Clustering Algorithm using Birch and K-Means” 2015 IEEE
4. Sandeep Kaur and Dr. Sheetal Kalra, “Disease Prediction using Hybrid K-means and Support Vector Machine” 2016 IEEE
5. Preeti Arora, Deepali Virmani, Himanshu Jindal and Mritunjaya Sharma, “Sorted K-Means Towards the Enhancement of K-Means to Form Stable Clusters”,
Proceedings of International Conference on Communication and Networks, Springer 2017
6. N. Nidheesh, K.A. Abdul Nazeer, P.M. Ameer, ” An enhanced deterministic K-means clustering algorithm for cancer subtype prediction from gene
expression data”, Computers in Biology and Medicine 2017 Elsevier
7. Sina Khanmohammadi, Naiier Adibeig, Samaneh Shanehbandy, “An Improved Overlapping k-Means Clustering Method for Medical Applications”, Expert
Systems With Applications 2016 Elsevier
8. Hailong Chen, Chunli LiuZahid, “Research and Application of Cluster Analysis Algorithm”. 2nd International Conference on Measurement, Information
and Control, 2013 IEEE
9. Shraddha Shukla and Naganna S, “A Review ON K-means DATA Clustering APPROACH” International Journal of Information & Computation Technology
2014
10. L.V. Bijuraj, “Clustering and its applications”. Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
Bibliography
1. Pankaj Saxena and Sushma Lehri, “Analysis of various clustering algorithms of data mining on Health informatics”. International Journal of Computer &
Communication Technology 2013
2. K.Rajalakshmi1, Dr.S.S.Dhenakaran2, N.Roobini, “Comparative Analysis of K-Means Algorithm in Disease Prediction” International Journal of Science,
Engineering and Technology Research (IJSETR), July 2015
3. Amit Saxena , Mukesh Prasad , Akshansh Gupta , Neha Bharill ,Om Prakash Patel , Aruna Tiwari , Meng Joo Er , Weiping Ding ,Chin-Teng Lin, ” A Review
of Clustering Techniques and Developments”. 2017 Elsevier
4. Guillaume Cleuziou, “An extended version of the k-means method for overlapping clustering” 2008 IEEE
WEBSITES
1. https://en.wikipedia.org/wiki/Cluster_analysis#Applications
2. http://stp.lingfil.uu.se/~santinim/ml/2016/Lect_10/10c_UnsupervisedMethods.pdf
3. https://en.wikipedia.org/wiki/K-means_clustering
4. https://www.jstatsoft.org/article/view/v050i10
5. https://en.wikipedia.org/wiki/Silhouette_(clustering)
6. https://en.wikipedia.org/wiki/Correlation_clustering
7. http://www.francescobonchi.com/CCtuto_kdd14.pdf
Paper Publication Certificate
Master's Thesis Presentation

Contenu connexe

Tendances

K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...Edureka!
 
Multivariate Data Analysis
Multivariate Data AnalysisMultivariate Data Analysis
Multivariate Data AnalysisMerul Romadhani
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jaxAjay Iet
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoostDataRobot
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboostmichiaki ito
 
K means clustering
K means clusteringK means clustering
K means clusteringKuppusamy P
 
Measurement error in medical research
Measurement error in medical researchMeasurement error in medical research
Measurement error in medical researchMaarten van Smeden
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
Missing Data and data imputation techniques
Missing Data and data imputation techniquesMissing Data and data imputation techniques
Missing Data and data imputation techniquesOmar F. Althuwaynee
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
Dm from databases perspective u 1
Dm from databases perspective u 1Dm from databases perspective u 1
Dm from databases perspective u 1sakthyvel3
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithmVinit Dantkale
 
Development and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutionsDevelopment and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutionsMaarten van Smeden
 
churn prediction in telecom
churn prediction in telecom churn prediction in telecom
churn prediction in telecom Hong Bui Van
 
Analysis of Time Series
Analysis of Time SeriesAnalysis of Time Series
Analysis of Time SeriesManu Antony
 

Tendances (20)

K means report
K means reportK means report
K means report
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
 
Multivariate Data Analysis
Multivariate Data AnalysisMultivariate Data Analysis
Multivariate Data Analysis
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jax
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Measurement error in medical research
Measurement error in medical researchMeasurement error in medical research
Measurement error in medical research
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Missing Data and data imputation techniques
Missing Data and data imputation techniquesMissing Data and data imputation techniques
Missing Data and data imputation techniques
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Dm from databases perspective u 1
Dm from databases perspective u 1Dm from databases perspective u 1
Dm from databases perspective u 1
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Development and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutionsDevelopment and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutions
 
Measures of relationship
Measures of relationshipMeasures of relationship
Measures of relationship
 
Classification
ClassificationClassification
Classification
 
churn prediction in telecom
churn prediction in telecom churn prediction in telecom
churn prediction in telecom
 
Analysis of Time Series
Analysis of Time SeriesAnalysis of Time Series
Analysis of Time Series
 

Similaire à Master's Thesis Presentation

A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clusteringijcsity
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
Max stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemMax stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemnooriasukmaningtyas
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGA HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGcscpconf
 
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemEnhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemAnders Viken
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means ClusteringJunghoon Kim
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 

Similaire à Master's Thesis Presentation (20)

A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
A046010107
A046010107A046010107
A046010107
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
Max stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemMax stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problem
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
Noura2
Noura2Noura2
Noura2
 
k-mean-clustering.pdf
k-mean-clustering.pdfk-mean-clustering.pdf
k-mean-clustering.pdf
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGA HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
 
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemEnhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 

Dernier

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 

Dernier (20)

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 

Master's Thesis Presentation

  • 1. Implementation of Integrated Approach of K-means Clustering Algorithm for Prediction Analysis For Partial Fulfillment for the Degree to be awarded by Gujarat Technological University Presented by Manisha Goyal(160130702006) Carried Out at Government Engineering College, Gandhinagar Under the Supervision of Prof. M.B. Chaudhari Dissertation Phase-II Presentation On
  • 2. Layout  Motivation and Objective of research work  Theoretical Background  Literature Review and Comparative Study  Problem Identification  Existing v/s Proposed Methodology  DP-1 and MSR Comments with Solutions  Implementation of Proposed Work  Results Analysis  Conclusion  Bibliography  Paper Publication Certificate
  • 3. Motivation of Work K-means is very old concept which is increasing in popularity day by day because of its simplicity and linear time complexity. However, it has two main disadvantages :- 1) Its highly sensitive to outlier and 2) Its highly dependent on initialization parameters (random choice of k clusters and position of initial cluster centroids). Many improved variants of K-means method are detailed in literature but still it is an open field of research because of its extensive application in the field of Medical, Business & Marketing, Social Media –Sentiment Analysis etc. Overlapping K-means is an extended version of K –means, which is fairly a new concept and widely using in various field where overlapping clusters are require. As it is an extension of K-means, it also needs improvements. There is a lot of scope for improvement as far its accuracy and dependability is concerned.
  • 4. Objective “The goal of this research work is to improve the accuracy of existing overlapping K-means Clustering by removing its dependency on initialization parameters (random choice of k clusters and placement of initial cluster centroids) and to evaluate the results using different measures for different applications”. To achieve this objective, the proposed algorithm performs the following steps: - 1) Preprocessing of raw dataset. 2) Calculate the optimum value of K (entirely based on the dataset, NOT as input from user). 3) Find position of initial centroids (using Proposed Harmonic Means method, not as random input from user) and then using above results apply OKM.
  • 6. Clustering • Objective: To find natural groupings among objects • It is an unsupervised learning problem which deals with finding a structure in a collection of unlabelled data. • Organize data in such a way that there is a
  • 7. Clustering category based on generated clusters 1. Exclusive (Non-overlapping) Clustering 2. Overlapping Clustering
  • 8. Why Overlapping Clustering?? Most of existing clustering methods assume that each data observation belongs to one and only one cluster leading to k disjoint clusters explaining the data. However, in many applications the data being modeled can have a much richer and more complex hidden representation where observations actually belong to multiple clusters. •In Social Network Analysis, community extraction algorithms need to detect overlapping clusters where an actor can belong to multiple communities.. • In Text Clustering, learning methods should be able to group document, which discuss more than one topic, into several groups. •In Medical Domain, various diseases share some common overlapping symptoms such as fever is common symptom in typhoid, malaria, viral infection and many others.
  • 9. K-means clustering • Partitioning method for clustering. • Objective: takes the input parameter, k, and partitions a set of n objects into k clusters. • Dissimilarity measures of K-means are: Euclidean Distance Manhattan Distance • Cluster mean is used to update the centroid of that cluster. • The aim of K-means is to minimize the objective function or the square-error criterion, defined as: 𝐸 = 𝑖=1 𝑘 𝑝∈𝑐𝑖 𝑝 − 𝑚𝑖 2 Where E is the sum of the square error for all objects in the data set; p is the point in space representing a given object; and mi is the mean of cluster Ci (both p and mi are multidimensional).
  • 10. How does K-means Work?? 1. Initialization: • Randomly Choose cluster centroid for K=2
  • 11. 2. Cluster Assignment: • Compute the distance between the data points and the cluster centroid by using dissimilarity measures. • Depending upon the minimum distance, data points are divided into 2 clusters.
  • 12. 3. Move Centroid: • Compute the mean of blue dots and reposition blue centroid to this mean • Compute the mean of orange dots and reposition orange centroid to this mean
  • 13. 4. Optimization and Convergence: • Repeat previous 2 steps iteratively till the cluster centroid stop changing their position. • At some point cluster does not change for further computation that is the point when algorithm converges • Below is the final cluster.
  • 14. Advantages and Disadvantages Advantages: • Easy to implement and robust. • Relatively scalable and efficient in processing large data sets with linear time complexity. • Produce tighter clusters than hierarchical clustering. Disadvantages: • Applied only when the mean of a cluster is defined. • Cannot be applied on categorical attributes. • Sensitive to the selection of number of a clusters k and initial cluster center. • Not suitable for discovering clusters with nonconvex shapes or clusters of very different size. • Sensitive to noise and outlier data points.
  • 15. Overlapping K-means (OKM) • The OKM method extends the objective function used in K-means to consider the possibility of overlapping clusters. • The K-means algorithm aims at clustering 𝑋 = 𝑥𝑖, … . . 𝑥 𝑛 into k clusters by minimizing the following objective function 𝑄 𝜋 = 𝑗=1 𝑘 𝑥 𝑖∈𝜋 𝑗 𝑥𝑖 − 𝑧𝑗 2 Where, 𝑥𝑖 is a ʋ-dimensional set of observations, 𝜋 = { 𝜋1, … … 𝜋 𝑘 } is the set of k clusters (𝜋𝑖 ∩ 𝜋𝑗 = ∅), and 𝑍 = {𝑧1, … . 𝑧 𝑘} is the set of cluster centroids.
  • 16. Cont… • The OKM approach relaxes the objective function of K-means to allow overlapping by removing the constraint 𝜋𝑖 ∩ 𝜋𝑗 = ∅, for 𝑖 ≠ 𝑗. • Objective function of OKM is defined as 𝑄′ 𝜋 = 𝑗=1 𝑛 𝑥𝑖 − ɸ 𝑥𝑖 2 • The ɸ 𝑥𝑖 = (ɸ1 𝑥𝑖 , … . ɸ 𝑟 𝑥𝑖 ) is the representation of 𝑥𝑖 also called ‘image’ or ‘barycenter of cluster's’ of 𝑥𝑖 defined as a combination of the centroids 𝑧𝑗 of the clusters 𝜋𝑗 where 𝑥𝑖 belongs to, computed as ɸ 𝑝 𝑥𝑖 = 𝑧 𝑗 𝜖𝜋(𝑥 𝑖) 𝑧 𝑗 |𝜋 𝑥 𝑖 | ,
  • 17. Cont… Here, the centroid 𝑧𝑗 ∈ 𝜋(𝑥𝑖) , where 𝜋(𝑥𝑖) is the list of all clusters that 𝑥𝑖 belongs to. The centroid 𝑧𝑗 is updated using the following equation: 𝑧𝑗 ∗ = 1 𝑥 𝑖∈𝜋 𝑗 1 𝛿 𝑖 2 𝑥 𝑖 𝜖𝜋 𝑗 1 𝛿 𝑖 2 (𝛿𝑖 × 𝑥𝑖 − 𝑧 𝑗∈𝜋(𝑥 𝑖)/𝑧 𝑗 𝑧𝑗), Where 𝛿𝑖 is the total number of clusters that 𝑥𝑖 belongs to (in this case 𝛿𝑖 = |𝜋 𝑥𝑖 |).
  • 18. Evaluation metrics 1. Sum of Square Error (SSE) 2. between_ss/total_ss Ratio 3. Number of Iterations 4. F-Measures and FBCubed Measures 5. Rand Index
  • 20. Literature Review and Comparative Study Title Publication 1. Applications of Partition based Clustering Algorithms: A Survey IEEE 2013 2. Performance Evaluation of a Novel Hybrid Clustering Algorithm using Birch and K- Means IEEE 2015 3. Disease Prediction using Hybrid K-means and Support Vector Machine IEEE 2016 4. Sorted K-Means Towards the Enhancement of K-Means to Form Stable Clusters Springer 2017 5. An enhanced deterministic K-means clustering algorithm for cancer subtype prediction from gene expression data Elsevier 2017 6. An Improved Overlapping k-Means Clustering Method for Medical Applications Elsevier 2017
  • 21. Title Techniques used Strength Weakness 1. Performance Evaluation of a Novel Hybrid Clustering Algorithm using Birch and K-Means K-means clustering algorithm and BIRCH  Better performance than K- Means and K-Medoid clustering  It can handle large datasets more effectively.  Results are vary with k values  Computation time can be reduced further 2. Disease Prediction using Hybrid K-means and Support Vector Machine hybrid K-means algorithm which uses the silhouette values to find k values initial centroids and Support Vector Machine algorithm The K-means achieved the accuracy of 82% and the hybrid algorithm achieved the accuracy of 92% on the same dataset Accuracy can further be improved by using improved k-Means algorithm and SVM. 3. Sorted K-Means Towards the Enhancement of K- Means to Form Stable Clusters Sorted (Merge or Quick sort) K-Means which determines initial centroids Effectively and efficiently used to form stable clusters with less number of iterations. Space and Time complexity for big data can be improved further
  • 22. Title Techniques used Strength Weakness 4. An enhanced deterministic K-means clustering algorithm for cancer subtype prediction from gene expression data density based version of K-Means The overall performance to the others compared algorithms It does not deal with outliers 5. An Improved Overlapping k-Means Clustering Method for Medical Applications k-harmonic means and overlapping k-means algorithms (KHMOKM)  Better performance than OKM  Better minimization of objective function  Rely on the Euclidean distance.  The algorithm depends on the initial selection of the number of clusters k.
  • 23. Chapter 3 Problem Identification Some of the following problems, identified during literature review are as follows. 1. Computation time of integrated approach is a big issue because integrating two approaches increases time complexity which is not acceptable. [paper2] 2. Some algorithms do not deal properly with outliers which decreases the accuracy overall for large datasets. [paper5] 3. Some algorithms do not work well with large datasets because of space complexity issue. [paper3] 4. Its increases the complexity of the algorithm when integrated KHM-OKM method is used. [paper6] 5. Most algorithms rely on Euclidean distances to find closest distances from centroids which is not suitable for all types of datasets.
  • 24. Chapter 4 Existing v/s Proposed Methodology
  • 25. Chapter 5 DP-1 and MSR Comments Sr. NO. DP1 Comments by External Status 1. Good Literature Review 2. Detailed Algorithm needs to be prepared. Done in MSR 3. Evaluate Complexity of your proposed approach Done in MSR Sr. NO. MSR Comments by External Status 1. Proposed Algorithm needs to be implemented with sufficiently large dataset Done 2. Implementation and results of work should be displayed during DP-2 Done
  • 26. MSR Comments with solution Comment 1: Proposed Algorithm needs to be implemented with sufficiently large dataset. Solution- We have taken following two large datasets ( Lung Cancer and Diabetes Disease): Sr. No. Dataset Name Size 1. Lung Cancer Dataset 1000*25 2. Diabetes Disease Dataset 768*9 Comment 2: Implementation and results of work should be displayed during DP-2 Solution: Implementation of Proposed algorithm and Results are described in succeeding sections.
  • 29. Chapter 6 Implementation of Proposed Work 1. Coding and Analysis is done in RStudio 2. Implementation of Integrated OKM in Weka4OC toolbox 3. Recording and analyzing Results in Excel sheet 2013
  • 30. Step 1: Import Dataset in R Studio and Analysis on it (Lung Cancer Dataset).
  • 31. Step 2: Apply Methods to find K- values.
  • 32. Step 3: Apply Proposed Harmonic Mean Method to find Initial Centroids.
  • 34. Chapter 7 Result Analysis (Original OKM v/s Integrated OKM) 1. Existing Methodology (original OKM) Random users inputs Run the above random inputs in Weka4Oc tool 2. Proposed Integrated OKM methodology Step 1: Determine appropriate k-value through algorithm Step 2: Calculation of initial centroid position through proposed harmonic means method Step 3: Run the inputs generated through above algorithm in Weka4Oc
  • 35. Proposed OKM Methodology Step 1: Determine appropriate K Value through algorithm Ratio of between _SS/Total_SS Methods to find K Lung Cancer Dataset Diabetes Data Set K=3_ Elbow Method 50.70% 74.90% K=2_Silhoutte Method 39.30% 55.70% K=1_Gap static Method 0.00% 0.00% Step 2 : Calculation of initial centroid position through Proposed Harmonic Means (post pre-processing, if required) Step 3: Input best value of K and initial centroid position calculated from above step in Weka4Oc
  • 36. Comparison of Results: Lung Cancer Dataset ‘8 different scenarios where 4 users enter random values in 8 different ways’ V/S the ‘integrated OKM algorithm R A N D O M INPUTS PROPOSED OKM User Value of K Initial Centorid position Value of Initial Centroid position precision Recall F- measure Rand Index Bcubed Precision # Iteration s User 1_R 2 Randomly generated NA 0.143 0.904 0.247 0.244 0.107 11 User 1_U 2 Random Input value by user 56,789 0.141 0.911 0.243 0.223 0.114 8 User 2_R 3 Randomly generated NA 0.159 0.777 0.264 0.352 0.116 7 User 2_U 3 Random Input value by user 45,578,899 0.142 0.798 0.242 0.253 0.116 9 User 3_R 4 Randomly generated NA 0.169 0.852 0.283 0.293 0.1 9 User 3_U 4 Random Input value by user 23,456,678,890 0.169 0.801 0.278 0.322 0.113 11 User 4_R 5 Randomly generated NA 0.188 0.853 0.309 0.325 0.101 9 User 4_U 5 Random Input value by user 23, 456, 658, 123, 897 0.196 0.631 0.3 0.479 0.173 8 0.163375 0.815875 0.27075 0.311375 0.1175 9 Integrated OKM algorithm 3 Harmonic Mean method 1,329,685 0.167 0.92 0.283 0.438 0.175 3 Average
  • 37.
  • 38. Graphs for Comparison of Results: Lung Cancer Data Set (1)
  • 39. Graphs for Comparison of Results: Lung Cancer Data Set (2)
  • 40. Graphs for Comparison of Results: Lung Cancer Data Set (3)
  • 41. Comparison of Results: Diabetes Disease Dataset ‘8 different scenarios where 4 users enter random values in 8 different ways’ V/S the ‘integrated OKM algorithm R A N D O M INPUT S ROPOSED OKM User ValueofK InitialCentorid position ValueofInitial Centroid position precision Recall F- measure Rand Index Bcubed Precision #Iterations User 1_R 2 Randomlygenerated NA 0.117 0.895 0.206 0.171 0.111 19 User 1_U 2 Random Inputvaluebyuser 54,678 0.117 0.895 0.206 0.171 0.111 18 User 2_R 3 Randomlygenerated NA 0.11 0.742 0.192 0.248 0.099 14 User 2_U 3 Random Inputvaluebyuser 16,390,481 0.116 0.815 0.203 0.229 0.095 25 User 3_R 4 Randomlygenerated NA 0.115 0.656 0.196 0.352 0.114 25 User 3_U 4 Random Inputvaluebyuser 4,317,590,712 0.111 0.782 0.194 0.217 0.098 29 User 4_R 5 Randomlygenerated NA 0.11 0.61 0.187 0.361 0.108 23 User 4_U 5 Random Inputvaluebyuser 5,18,386,495,600 0.113 0.721 0.195 0.283 0.105 38 0.113625 0.7645 0.197375 0.254 0.105125 23.875 Integrated OKM Method 3 HarmonicMean method 1,254,524 0.131 0.802 0.226 0.337 0.099 8 Average
  • 42.
  • 43. Graphs for Comparison of Results: Diabetes Disease Data Set (1) 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 precision precision Proposed OKM algorithm 0.131 User 4_U 0.113 User 4_R 0.11 User 3_U 0.111 User 3_R 0.115 User 2_U 0.116 User 2_R 0.11 User 1_U 0.117 User 1_R 0.117 PRECISON 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Recall Proposed OKM algorithm 0.802 User 4_U 0.721 User 4_R 0.61 User 3_U 0.782 User 3_R 0.656 User 2_U 0.815 User 2_R 0.742 User 1_U 0.895 User 1_R 0.895 RECALL
  • 44. Graphs for Comparison of Results: Diabetes Disease Data Set (2) 0 0.05 0.1 0.15 0.2 0.25 F-measure F-measure Proposed OKM algorithm 0.226 User 4_U 0.195 User 4_R 0.187 User 3_U 0.194 User 3_R 0.196 User 2_U 0.203 User 2_R 0.192 User 1_U 0.206 User 1_R 0.206 F MEASURE 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Rand Index Rand Index Proposed OKM algorithm 0.337 User 4_U 0.283 User 4_R 0.361 User 3_U 0.217 User 3_R 0.352 User 2_U 0.229 User 2_R 0.248 User 1_U 0.171 User 1_R 0.171 RAND INDEX
  • 45. Graphs for Comparison of Results: Diabetes Disease Data Set (3) 0 0.02 0.04 0.06 0.08 0.1 0.12 Bcubed Precision Bcubed Precision Proposed OKM algorithm 0.099 User 4_U 0.105 User 4_R 0.108 User 3_U 0.098 User 3_R 0.114 User 2_U 0.095 User 2_R 0.099 User 1_U 0.111 User 1_R 0.111 B CUBED PRECISION 0 5 10 15 20 25 30 35 40 # Iterations # Iterations Proposed OKM algorithm 8 User 4_U 38 User 4_R 23 User 3_U 29 User 3_R 25 User 2_U 25 User 2_R 14 User 1_U 18 User 1_R 19 No. OF ITERATIONS
  • 46. Conclusion The thesis was robust enough to show positive results as it: (i) removes the dependency of the method from any random input parameters but also (ii) normalizes the outliers . From the above results we found that, barring one or two accuracy measures, the performance of the Proposed integrated OKM tool is better than usual OKM method.  We can also observe that integrated OKM helps us in reducing the time complexity in both cases as the number of integrations are reducing greatly. As far as future work is concerned, this thesis provides a base for further research on effective improved clustering which can create a long lasting positive impact on medical field and many other fields.
  • 47. Bibliography PAPERS: 1. Argenis A. Aroche-Villarruel1, J.A. Carrasco-Ochoa1, José Fco. Martínez-Trinidad1,J. Arturo Olvera-López2, and Airel Pérez-Suárez3, “Study of Overlapping Clustering Algorithms Based on Kmeans through FBcubed Metric”, Springer International Publishing Switzerland 2014 2. A.Dharmarajan, T. Velmurugan, “Applications of Partition based Clustering Algorithms: A Survey” 2013 IEEE 3. Jaskaranjit Kaur and Harpreet Singh, “Performance Evaluation of a Novel Hybrid Clustering Algorithm using Birch and K-Means” 2015 IEEE 4. Sandeep Kaur and Dr. Sheetal Kalra, “Disease Prediction using Hybrid K-means and Support Vector Machine” 2016 IEEE 5. Preeti Arora, Deepali Virmani, Himanshu Jindal and Mritunjaya Sharma, “Sorted K-Means Towards the Enhancement of K-Means to Form Stable Clusters”, Proceedings of International Conference on Communication and Networks, Springer 2017 6. N. Nidheesh, K.A. Abdul Nazeer, P.M. Ameer, ” An enhanced deterministic K-means clustering algorithm for cancer subtype prediction from gene expression data”, Computers in Biology and Medicine 2017 Elsevier 7. Sina Khanmohammadi, Naiier Adibeig, Samaneh Shanehbandy, “An Improved Overlapping k-Means Clustering Method for Medical Applications”, Expert Systems With Applications 2016 Elsevier 8. Hailong Chen, Chunli LiuZahid, “Research and Application of Cluster Analysis Algorithm”. 2nd International Conference on Measurement, Information and Control, 2013 IEEE 9. Shraddha Shukla and Naganna S, “A Review ON K-means DATA Clustering APPROACH” International Journal of Information & Computation Technology 2014 10. L.V. Bijuraj, “Clustering and its applications”. Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
  • 48. Bibliography 1. Pankaj Saxena and Sushma Lehri, “Analysis of various clustering algorithms of data mining on Health informatics”. International Journal of Computer & Communication Technology 2013 2. K.Rajalakshmi1, Dr.S.S.Dhenakaran2, N.Roobini, “Comparative Analysis of K-Means Algorithm in Disease Prediction” International Journal of Science, Engineering and Technology Research (IJSETR), July 2015 3. Amit Saxena , Mukesh Prasad , Akshansh Gupta , Neha Bharill ,Om Prakash Patel , Aruna Tiwari , Meng Joo Er , Weiping Ding ,Chin-Teng Lin, ” A Review of Clustering Techniques and Developments”. 2017 Elsevier 4. Guillaume Cleuziou, “An extended version of the k-means method for overlapping clustering” 2008 IEEE WEBSITES 1. https://en.wikipedia.org/wiki/Cluster_analysis#Applications 2. http://stp.lingfil.uu.se/~santinim/ml/2016/Lect_10/10c_UnsupervisedMethods.pdf 3. https://en.wikipedia.org/wiki/K-means_clustering 4. https://www.jstatsoft.org/article/view/v050i10 5. https://en.wikipedia.org/wiki/Silhouette_(clustering) 6. https://en.wikipedia.org/wiki/Correlation_clustering 7. http://www.francescobonchi.com/CCtuto_kdd14.pdf