SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
204
DYNAMIC APPROACH TO k-Means CLUSTERING ALGORITHM
Deepika Khurana1
and Dr. M.P.S Bhatia2
1
(Department of Computer Engineering, Netaji Subhas Institute of Technology, University of
Delhi, New Delhi, India)
2
(Department of Computer Engineering, Netaji Subhas Institute of Technology, University of
Delhi, New Delhi, India)
ABSTRACT
k-Means clustering algorithm is a heuristic algorithm that partitions the dataset into k
clusters by minimizing the sum of squared distance in each cluster. In contrast, there are
number of weaknesses. First it requires a prior knowledge of cluster number ‘k’. Second it is
sensitive to initialization which leads to random solutions. This paper presents a new
approach to k-Means clustering by providing a solution to initial selection of cluster centroids
and a dynamic approach based on silhouette validity index. Instead of running the algorithm
for different values of k, the user need to give only initial value of k as ko as input and
algorithm itself determines the right number of clusters for a given dataset. The algorithm is
implemented in the MATLAB R2009b and results are compared with the original k-Means
algorithm and other modified k-Means clustering algorithms. The experimental results
demonstrate that our proposed scheme improves the initial center selection and overall
computation time.
Keywords: Clustering, Data mining, Dynamic, k-Means, Silhouette validity index.
I. INTRODUCTION
Data Mining is defined as mining of knowledge from huge amount of data. Using
Data mining we can predict the nature and behaviour of any kind of data. It was recognized
that information is at the heart of the business operations and that decision makers could
make the use of data stored to gain the valuable insight into the business. DBMS gave access
to the data stored but this was only small part of what could be gained from the data.
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING
& TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 3, May-June (2013), pp. 204-219
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
205
Analyzing data can further provide the knowledge about the business by going beyond the
data explicitly stored to derive knowledge about the business.
Learning valuable information from the data made clustering techniques widely
applied to the areas of artificial intelligence, customer – relationship management, data
compression, data mining, image processing, machine learning, pattern recognition, market
analysis, and fraud-detection and so on. Cluster Analysis of a data is an important task in
Knowledge Discovery and Data Mining. Clustering is the process to group the data on the
basis of similarities and dissimilarities among the data elements. Clustering is the process of
finding the group of objects such that object in one group will be similar to one another and
different from the objects in the other group.
Clustering is an unsupervised algorithm, which requires a parameter that specifies the
number of clusters k. For setting this parameter either requires detailed knowledge of the
dataset or requires the algorithm to be run for different values of k to determine the correct
number of clusters. However for large and multidimensional data process of clustering
becomes time consuming and determining the correct number of clusters in large data
becomes difficult.
The k-Means clustering algorithm is an old algorithm that has been intensely
researched, owing to its ease and simplicity of implementation. However there have also been
criticisms on its performance, in particularly for demanding the value of k in prior. It is
evident from the previous researches that providing the number of clusters in prior does not
in any way assist in the production of good quality clusters. Original k-Means also
determines the initial centers randomly in each run which leads to different solutions.
To validate the clustering results we have chosen Silhouette validity index as a
validity measure. The Silhouettes validity index is particularly useful when seeking to know
the number of clusters that are compact and well separated. This index is used after the
clustering to check the validity of clusters produced.
This paper presents a new method for selection of the initial k centers and a dynamic
approach to k-Means clustering. Initial value of k as ko is provided by the user. The algorithm
will then partition the whole space into different segments and calculate the frequency of data
points in each segment. The ko highest frequency segments are then chosen as initial ko
clusters. To determine the initial centers, the algorithm will calculate for each segment the
distance of points from origin; sort them and then coordinates corresponding to the mid value
of the distance is chosen to be the center for that segment. Then cluster assignment process is
done. Then the Silhouettes validity index is calculated for the initial ko clusters. This step is
then repeated for ( ko +2) and (ko -2) number of clusters. The algorithm will then iterate
again for specified conditions and stop at the maximum value of silhouette index yielding k
correct number of clusters. The proposed approach is dynamic in the sense that user need not
to check the algorithm for different values of k. Instead the algorithm stops itself at best value
of k giving compact and separated clusters. Proposed algorithm shows that it takes less
execution time when compared with Original k-Means and modified approach to k-Means
clustering.
The paper is organised as follows: Section 2 presents related work. Silhouette validity
index is discussed in 3. Section 4 describes Original k-Means. 5 and 6 sections details the
approaches discussed in [1] and [2] respectively. Section 7 describes the proposed algorithm.
Section 8 shows implementation results. Conclusion and future work is presented in section
9.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
206
II. RELATED WORK
In literature [1] there is an improved k-Means algorithm based on the improvement of
the sensitivity of the initial centers. This algorithm partitions the whole data space into
different segments and calculates the frequency of points in each segment. The segment
which shows the maximum frequency will b considered for initial centroid depending upon
the value of k.
In literature [2] another method of finding initial cluster centers is discussed. It first
finds closest pair of data points and then on the basis of these points it forms the subset of
dataset, and this process is repeated k times to find k small subsets, to find initial k centroids.
The author in literature [3] uses Principal Component Analysis for dimension
reduction and to find initial cluster centers.
In [4] first data set is pre-processed by transforming all data values to positive space
and then data is sorted and divided into k equal sets and then middle value of each set is taken
as initial center point.
In literature[5] a dynamic solution to k –Means is proposed that algorithm is designed
with pre-processor using silhouette validity index that automatically determines the
appropriate number of clusters, that increase the efficiency for clustering to a great extent.
In [6] a method is proposed to make algorithm independent of number of iterations
that avoids computing distance of each data point to cluster centers repeatedly, saving
running time and reducing computational complexity.
In the literature [7] dynamic means algorithm is proposed to improve the cluster
quality and optimizing the number of clusters. The user has the flexibility either to fix the
number of clusters or input the minimum number of clusters required. In the former case it
works same as k-Means algorithm. In the latter case the algorithm computes the new cluster
centers by incrementing the cluster count by one in every iteration until it satisfies the
validity of cluster quality
In [8] the main purpose is to optimize the initial centroids for k-Means
algorithm.Author proposed Hierarchical k-Means algorithm. It utilizes all the clustering results
of k-Means in certain times, even though some of them reach the local optima. Then,
transform the all centroids of clustering result by combining with Hierarchical algorithm in
order to determine the initial centroids for k-Means. This algorithm is better used for the
complex clustering cases with large data set and many dimensional attributes.
III. SILHOUETTE VALIDITY INDEX
The Silhouette value for each point is a measure of how similar that point is to the
points in its own cluster compared to the points in other clusters. This technique computes the
silhouette width for each data point, silhouette width for each cluster and overall average
silhouette width.
The silhouette width for the ith
point of mth
cluster is given by equation 1:
ࡿ࢏ሺ࢓ሻ ൌ
࢈࢏ିࢇ࢏
࢓ࢇ࢞ሺ࢈࢏,ࢇ࢏ሻ
ሺ૚ሻ
Where ai is the average distance from the ith
point to the other points in its cluster and bi is
the minimum of the average distance from point i to the points in the other k-1 clusters. It
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
207
ranges from -1 to +1. Every point i with a silhouette index close to 1 indicates that it belongs
to the cluster being assigned. A value of zero indicates object could also be assigned to
another closest cluster. A value of close to -1 indicates that object is wrongly clustered or is
somewhere between clusters.
The silhouette width for each cluster is given by equation 2:
ࡿሺ࢓ሻ ൌ
૚
࢔ሺ࢓ሻ
∑ ࡿ࢏ሺ࢓ሻ
࢔ሺ࢓ሻ
࢏ୀ૚ ሺ૛ሻ
The overall average silhouette width is given by equation 3:
ࡿ ൌ
૚
࢑
∑ ࡿሺ࢓ሻ ሺ૜ሻ࢑
࢓ୀ૚
We have used this silhouette validity index as a measure of cluster validity in the
implementation of Original k –Means, modified approach I and modified approach II. We
have used this measure as a basis to make the proposed algorithm work dynamically. .
IV. ORIGINAL K-MEANS ALGORITHM
The k-Means algorithm takes the input parameter k, and partition a set of n objects
into k clusters so that the resulting intra-cluster similarity is high but the inter-cluster
similarity is low cluster similarity is measured in regard to the mean value of the objects in a
cluster which can be viewed as a cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows:
1. Randomly select k of the objects, each of which initially represents a cluster mean or
center.
2. For each of the remaining objects, an object is assigned to a cluster to which it is the
most similar, based on the distance between the object and the cluster mean. It then
computes the new mean for each cluster using equation 4:
3.
ࡹ࢐ ൌ
૚
࢔࢐
෍ ࢆ࢖
ࢺࢆ࢖‫࡯א‬࢏
ሺ૝ሻ
Where, Mj is centroid of cluster j and nj is the number of data points in cluster j.
4. This process iterates until the criterion function converges.
Typically the square – error criterion is used, defined using equation 5:
ࡱ ൌ ෍ ෍|࢖ െ ࢓࢏| ሺ૞ሻ
࢖‫࡯א‬࢏
࢑
࢏ୀ૚
Where p is the data point and mi is the center for cluster Ci. E is the sum of squared
error of all points in dataset. The distance of criterion function is the Euclidean
distance which is used to calculate the distance between data point and cluster center.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
208
The Euclidean distance between two vectors x = (x1, x2 , x3 , x4-------- xn) and y= (y1 , y2
, y3 , y4 ---- yn) can be calculated using equation 6:
ࢊሺ࢞࢏, ࢟࢏ሻ ൌ ෍ ඥሺ࢞࢏ െ ࢟࢏ሻ૛
࢔
࢏ୀ૚
ሺ૟ሻ
Algorithm: The k –Means algorithm for partitioning, where each cluster’s center is
represented by the mean value of the objects in the cluster.
Input:
• k: the number of clusters,
• D: a data set containing n objects.
Output: A set of k clusters.
Method:
1. arbitrarily choose k objects from D as the initial cluster centers;
2. repeat
3. (re)assign each object to the cluster to which the object is most similar, based
on the mean value of the objects in the cluster;
4. update the cluster means , i.e., calculate the mean value of objects for each
cluster;
5. until no change;
V. MODIFIED APPROACH I
The first approach discussed in [1] optimizes the Original k –Means algorithm by
proposing a method on how to choose initial clusters. The author proposed a method that
partitions the given input data space into k * k segments, where k is desired number of
clusters. After portioning the data space, frequency of each segment is calculated and highest
k frequency segments are chosen to represent initial clusters. If some parts are having same
frequency, the adjacent segments with the same least frequency are merged until we get the k
number of segments. Then initial centers are calculated by taking the mean of the data points
in respective segments to get the initial k centers. By this process we will get the initial which
are always same as compared to the Original k – Means algorithm which always selects
initial random centers for a given dataset.
Next, a threshold distance is calculated for each centroid is defined as distance
between each cluster centroid and for each centroid take the half of the minimum distance
from the remaining centroids. Threshold distance is denoted by dc(i) for the cluster C i .
To assign the data point to the cluster, take a point p in the dataset and calculate its
distance from the centroid of cluster i and compare it with dc(i) . If it is less than or equal to
dc(i) then assign the data point p to the cluster i else calculate its distance from other
centroids. This process is repeated until data point p is assigned to one of the cluster. If data
point p is not assigned to any of the cluster then the centroid which shows minimum distance
for the data point p becomes the centroid for that point. The centroid is then updated by
calculating the mean of the data points in the cluster.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
209
Pseudo code for Modified k-Means algorithm is as follows:
Input: Dataset of N data points D (i = 1 to N)
Desired number of clusters = k
Output: N data points clustered into k clusters.
Steps:
1. Input the data set and value of k.
2. If the value of k is 1 then Exit.
3. Else
4. /*divide the data point space into k*k, means k Vertically and k horizontally*/
5. For each dimension{
6. Calculate the minimum and maximum value of data Points.
7. Calculate range of group(RG) using equation 7:
ࡾࡳ ൌ
ሺ࢓࢏࢔ ൅ ࢓ࢇ࢞ሻ
࢑
ሺૠሻ
8. Divide the data point space in k group with width RG
9. }
10. Calculate the frequency of data points in each partitioned space.
11. Choose the k highest frequency group.
12. Calculate the mean of selected group. /* These will be the initial centroids of k
clusters.*/
13. Calculate the distance between each clusters using equation 8:
ࢊ൫࡯࢏, ࡯࢐൯ ൌ ൛ࢊ൫࢓࢏, ࢓࢐൯: ሺ࢏, ࢐ሻ ‫א‬ ሾ૚, ࢑ሿ & ݅ ് ݆ൟ ሺૡሻ
Where d(C i, C j) is distance between centroid i and j
14. Take the minimum distance for each cluster and make it half using equation 9:
ࢊࢉሺ࢏ሻ ൌ
૚
૛
൛࢓࢏࢔ൣ ࢊ൫࡯࢏, ࡯࢐൯, … … … … … ൧ൟ ሺૢሻ
Where, dc(i) is half of the minimum distance of i th
cluster from other remaining
clusters.
15. For each data points Zp= 1 to N {
16. For each cluster j= 1 to k {
17. Calculate d(Zp,Mj) using equation 10:
ࢊሺ࢞࢏, ࢟࢏ሻ ൌ ∑ ሺ࢞࢏ െ ࢟࢏ሻ૛࢔
࢏ୀ૚ (10)
where d(xi,yi) is the distance between vector vectors x = (x1, x2 , x3 , x4-------- xn)
and y= (y1 , y2 , y3 , y4 ---- yn).
18. If (d(Zp,Mj)) ≤ dcj){
19. Then Zp assign to cluster Cj .
20. Break;
21. }
22. Else
23. Continue;
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
210
24. }
25. If Zp, does not belong to any cluster then
26. Zp, ϵ min(d(Zp, , Mi)) where iϵ [1, k]
27. }
28. Check the termination condition of algorithm if Satisfied
29. Exit.
30. Else
31. Calculate the centroid of cluster using equation 11:
ࡹ࢐ ൌ
૚
࢔࢐
෍ ࢆ࢖ ሺ૚૚ሻ
ࢺࢆ࢖‫࡯א‬࢐
Where Mj is centroid of cluster j and nj is the number of data points in cluster j.
32. Go to step 13.
VI. MODIFIED APPROACH II
In the work of [2], author calculate the distance of between each data points and select
that pair which show the minimum distance and remove it from actual dataset. Then took one
data point from data set and calculate the distance between selected initial point and data
point from data set and add with initial data point which show the minimum distance. Repeat
this process till threshold value achieved. If number of subsets formed is less than k then
again calculate the distance between each data point from the rest data set and repeat that
process till k cluster formed.
First phase is to determine initial centroids, for this compute the distance between
each data point and all other data points in the set D. Then find out the closest pair of data
points and form a set A1 consisting of these two data points, and delete them from the data
point set D. Then determine the data point which is closest to the set A1, add it to A1 and
delete it from D. Repeat this procedure until the number of elements in the set A1 reaches a
threshold. Then again form another data-point set A2. Repeat this till ’k’ such sets of data
points are obtained. Finally the initial centroids are obtained by averaging all the vectors in
each data-point set. The Euclidean distance is used for determining the closeness of each data
point to the cluster centroids
Next phase is to assign points to the clusters. Here the main idea is to set two simple
data structures to retain the labels of cluster and the distance of all the data objects to the
nearest cluster during the each iteration, that can be used in next iteration, we calculate the
distance between the current data object and the new cluster center, if the computed distance
is smaller than or equal to the distance to the old center, the data object stays in it’s cluster
that was assigned to in previous iteration. Therefore, there is no need to calculate the distance
from this data object to the other k- 1clustering center, saving the calculative time to the k-1
cluster centers. Otherwise, we must calculate the distance from the current data object to all k
cluster centers, and find the nearest cluster center and assign this point to the nearest cluster
center. And then we separately record the label of nearest cluster center and the distance to its
center. Because in each iteration some data points still remain in the original cluster, it means
that some parts of the data points will not be calculated, saving a total time of calculating the
distance, thereby enhancing the efficiency of the algorithm.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May
Pseudo code for modified k- Means algorithm is as follows:
Input: Dataset D of N data points (i = 1 to N)
Desired number of clusters = k
Output: N data points clustered into k clusters.
Phase 1:
Steps:
1. Set m = 1;
2. Compute the distance between each data point and all other data
set D using equation
where d(xi,yi) is the distance between vector
and y= (y1 , y2 , y3
3. Find the closest pair of data points from the set D and form a data
(1<= m <= k) which contains these two data
points from the set D;
4. Find the data point in D that is closest to the data point set A
and delete it from D;
5. Repeat step 4 until the number of data points in A
6. If m<k, then m = m+1, find another pair of data points from D between which
the distance is the shortest, form another data
from D, Go to step 4;
7. For each data-point set Am (1<=m<=k) find the arithmetic mean of the vect
of data points in A
Phase 2:
Steps:
1. Compute the distance of each data
(1<=j<=k) as d(di,
2. For each data-point
3. Set Cluster Id[i]=j; /* j: Id of the closest cluster for point i */.
4. Set Nearest _Dist[i]=
5. For each cluster j
6. Repeat
7. For each data-point d
a. Compute its distance from the centr
b. If this distance is less than or equal to the present nearest distance, the
data-point stays in the cluster;
c. Else for every centroid
d. End for;
8. Assign the data-point
9. Set ClusterId[i]=j;
10. Set Nearest_Dist[i] =
11. End for (step(2));
12. For each cluster j
criteria is met i.e. either no center updates or no point moves to
cluster.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
211
Means algorithm is as follows:
Dataset D of N data points (i = 1 to N)
Desired number of clusters = k
N data points clustered into k clusters.
Compute the distance between each data point and all other data-
set D using equation 12:
(12)
he distance between vector vectors x = (x1, x2 , x
3 , y4 ---- yn).
Find the closest pair of data points from the set D and form a data
= k) which contains these two data- points, Delete these two data
points from the set D;
Find the data point in D that is closest to the data point set Am, Add it to A
and delete it from D;
Repeat step 4 until the number of data points in Am reaches 0.75*
If m<k, then m = m+1, find another pair of data points from D between which
the distance is the shortest, form another data-point set Am and delete them
from D, Go to step 4;
point set Am (1<=m<=k) find the arithmetic mean of the vect
of data points in Am, these means will be the initial centroids
Compute the distance of each data-point di (1<=i<=N) to all the centroids C
, Cj) using equation (4.1)
point di, find the closest centroid Cj and assign di to cluster
Set Cluster Id[i]=j; /* j: Id of the closest cluster for point i */.
Set Nearest _Dist[i]= d(di, Cj);
j (1<=j<=k), recalculate the centroids;
point di,
Compute its distance from the centroid of the present nearest cluster;
If this distance is less than or equal to the present nearest distance, the
point stays in the cluster;
Else for every centroid cj (1<=j<=k) compute the distance
point di to the cluster with the nearest centroid Cj
Set ClusterId[i]=j;
Set Nearest_Dist[i] = d(di, Cj);
End for (step(2));
j (1<=j<=k), Recalculate the centroids until the convergence
criteria is met i.e. either no center updates or no point moves to
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
June (2013), © IAEME
- points in the
)
, x3 , x4-------- xn)
Find the closest pair of data points from the set D and form a data-point set Am
points, Delete these two data
, Add it to Am
reaches 0.75*(N/k);
If m<k, then m = m+1, find another pair of data points from D between which
point set Am and delete them
point set Am (1<=m<=k) find the arithmetic mean of the vectors
(1<=i<=N) to all the centroids Cj
to cluster j.
oid of the present nearest cluster;
If this distance is less than or equal to the present nearest distance, the
(1<=j<=k) compute the distance d(di, Cj);
(1<=j<=k), Recalculate the centroids until the convergence
criteria is met i.e. either no center updates or no point moves to another
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
212
VII. PROPOSED APPROACH
The changes are based on the selection of initial k centers and making the algorithm to
work dynamically, i.e. instead of running algorithms for different values of k, we try to make
algorithm in such a way that it itself decides how many clusters are there in a given dataset.
The two modifications are as follows:
• A method to select initial k centers.
• To make algorithm dynamic.
Proposed Algorithm:
The algorithm consists of three phases:
First phase is to determine the initial k centroids. In this phase user inputs the dataset and
value of k. The data space is divided into k*k segments as discussed in [1]. After dividing the
data space we choose the k segments with the highest frequency of points. If some parts are
having same frequency, the adjacent segments with the same least frequency are merged until
we get the k number of segments.
Then we find the distance of each point in each selected segment with the origin and these
distances are sorted for each segment and then middle point is selected as the center for that
segment. This step is repeated for each k selected segments. These represent the initial k
centroids.
Second phase is to assign points to the cluster based on the minimum distance between the
point and the cluster centers. The distance measure used is Euclidean distance. It then
computes the mean of the clusters formed as the next centers. This process is repeated until
no more center updates.
Third phase is where algorithm iterates dynamically to determine right number of clusters.
To choose the right number of clusters we use the concept the concept of Silhouette Validity
Index.
Pseudo Code for Proposed Algorithm:
Input: Dataset of N points.
Desired number of k clusters.
Output: N points grouped into k clusters.
Phase1: Finding Initial centroids
Steps:
1. Input the dataset and value of k ≥ 2.
2. Divide the data point set into k*k segments /*k vertically and k horizontally*/
3. For each dimension
{
4. Calculate the minimum and maximum value of data points.
5. Calculate the width (Rg) using equation 13:
ࡾࢍ ൌ
࢓࢏࢔ା࢓ࢇ࢞
࢑
ሺ૚૜ሻ
}
6. Calculate the frequency of data points in each segment.
7. Choose the k highest frequency segments.
8. For each segment i = 1 to k
{
9. For each point j in the segment i
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
213
{
10. Calculate the distance of point j with origin
}
11. Sort these distances in ascending order in matrix D
12. Select the middle point distance.
13. The co-ordinates corresponding to the distance in 12 is chosen as initial center
for the ith
cluster.
}
14. These k co-ordinates are stored in matrix C which represents the initial
centroids.
Phase2: Assigning points to the cluster
Steps:
1. Repeat
2. For each data point p = 1 to N
{
3. For each cluster j = 1 to k
{
4. Calculate distance between point p and cluster centroid cj of Cj using equation
14:
ࢊ൫࢖ , ࢉ࢐൯ ൌ ටሺ࢖ െ ࢉ࢐ሻ૛૛
(14)
}
}
5. Assign p to min{d(p,cj)}where j [1,k].
6. Check the termination condition of the algorithm if Satisfied
7. Exit
8. Else
9. Calculate the new centroids of cluster using 15:
ࢉ࢐ ൌ
૚
࢔࢐
∑ ࢖ ሺ૚૞ሻࢺ࢖‫࡯א‬࢐
Where nj is the number of points in cluster‫ܥ‬௝.
10. Go to step 1.
Phase3: To determine appropriate number of clusters
For the given value o the phase 1 and 2 are run for three iterations using k-2, k and
k +2. Three corresponding Silhouette values are calculated as discussed in section
2. These are denoted by Sk-2, Sk, Sk+2. The appropriate number of clusters is then
found using following steps.
Steps:
1. If Sk-2 < Sk and Sk > Sk+2 then run phase 1 and phase 2 using k+1 and k-1 and
corresponding Sk+1 and Sk-1 are found. The maximum of the three Sk-1, Sk, Sk+1
then determines the value of k as appropriate number of clusters. For example
if Sk+1 is maximum, then number of clusters formed by the algorithm is k+1.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
214
2. Else If Sk+2 > Sk and Sk+2 > Sk-2 then run phase 1 and phase 2 using k+1, k+3,
k+4 and corresponding Sk+1, Sk+3 and Sk+4 are found. The k values
corresponding to maximum of the Sk+1, Sk+2, Sk+3, Sk+4 is returned.
3. Else If Sk+2 < Sk-2 and Sk < Sk-2 then run phase 1 and phase 2 using k-1, k-2, k-
3, k-4 and corresponding Sk-1, Sk-2, Sk-3 and Sk-4 are found. The k values
corresponding to maximum of the Sk-1, Sk-2, Sk-3, Sk-4 is returned.
4. Stop.
Thus the algorithm terminates itself where the best value of k is found. This
value of k shows appropriate number of clusters for a given data set.
VIII. RESULT ANALYSIS
The proposed algorithm is implemented and results are compared with that of
modified approach [1] and [2] in terms of execution time and initial centers chosen.
1. The total time taken by the algorithm to form clusters and dynamically
determining the appropriate number of clusters is actually less than the total
time taken by the algorithm [1] to run for different values of k.
For example if we run algorithm in [1] for different values of k such as k = 2,
3, 4, 5, 6, 7, etc. The algorithm in [1] takes more time as compared to the
proposed algorithm which itself runs for different values of k.
2. We define new method to determine initial centers that is based on middle
value as compared to mean value. The reason behind this is that the middle
value best represents the distribution and moreover as mean is influenced by
too large and too small values, the middle value is not affected by this.
The results show that algorithm works dynamically and is also an improvement over
original k-Means. Table I shows results of running algorithm in [1] over wine dataset from
UCI repository for k = 3, 4, 5, 6, 7 and 9. The algorithm is run for these values of k because
in proposed algorithm we initially fed k =7 and algorithm runs for these values of k
automatically and so total execution time of both algorithms are compared. And results shows
that proposed algorithm take less time than running the algorithm [1] individually for
different values of k.
TABLE I: RESULTS OF ALGORITHM [1] FOR WINE DATASET
Sr. no. Value of k Silhouette validity Index Execution time (s)
1. 3 0.4437 3.92
2. 4 0.3691 4.98
3. 5 0.3223 2.89
4. 6 0.2923 7.51
5. 7 0.2712 3.56
6. 9 0.2082 11.96
TOTAL EXECUTION TIME 34.82
The results for the proposed algorithm show different runs and stops at:
maximum silhouette value = 0.443641 for best value of k = 3
Elapsed time is 30.551549 seconds.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
215
Thus, from results it is clear that algorithm stops itself when it finds the right number
of clusters. It shows that proposed algorithm takes less time as compared to algorithm in [1].
Table II shows results of execution time for both algorithms. It also shows initial value fed to
our algorithm and where the best value of k is found the algorithm stops. Experiments are
performed on random datasets of 50, 178 and 500 points.
TABLE II COMPARING RESULTS OF PROPOSED ALGORITHM &
ALGORITHM IN [1]
Sr.
no.
Dataset Initial value of
k
Best value of
k
Proposed
algorithm time (s)
Algorithm [1]
time (s)
1. 50 points 4 6 18.6084 28.0096
2. 178
points
9 10 39.1941 50.2726
3. 500
points
5 4 66.6134 91.9400
Table III shows comparison results between original k-means, modified approach II
and proposed algorithm. When comparing execution times of proposed algorithm with other
algorithms, it is seen that proposed algorithm takes much less time than the original k-Means
and for the large dataset such as dataset of 500 points; the proposed algorithm also
outperforms the modified approach II.
TABLE- III EXECUTION TIME(s) COMPARISON
Sr. No. Dataset Original k-
Means
Modified approach
II
Proposed algorithm
1. 50 points 15.1727 11.4979 18.6084
2. 178 points 74.5168 21.6497 39.1941
3. 500 points 86.7619 87.2461 66.6134
From all the results we can conclude that although procedure of proposed algorithm is long
but it prevents user from running the algorithm for different values of k as in other three
algorithms discussed in previous chapters. The proposed algorithm dynamically iterates and
stops at best value of k.
Figure I –III shows different silhouette plots for all 3 datasets of random points
discussed above depicting how close a point to the other members of its own cluster is. The
plot also shows that if any point is not placed incorrect cluster if the silhouette index value for
that point is negative. Figure IV shows execution time comparison of all the algorithms
discussed in the paper.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
216
Figure 1 Silhouette plot for 50 points
Figure 2 Silhouette plot for 178 points
Figure 3 Silhouette plot for 500 points
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
217
Figure 4 Execution time (s) comparison of different algorithms
Table 4 shows comparison between all four algorithms discussed in paper on the basis of
experiments results.
TABLE 1V: COMPARISON OF ALGORITHMS
Parameters Original k-
Means
Modified Approach I Modified Approach II Proposed Algorithm
Initial
Centers
Always random
and thus
different
clusters for
same value of k
for given data.
Way to select initial
centers is fixed by
always choosing initial
centers in the highest
frequency segment.
Selection of initial
centers by always
choosing points based
on the similarity
between points.
Initial centers are fixed
by choosing the centers
in the highest frequency
segment, which is
middle point of that
segment points
calculated from origin.
Redundant
Data
Can work with
data
redundancies.
Suitable. Not suitable for data
with redundant points.
Can work with
redundant data.
Dead Unit
Problem
Yes. No. No. No.
Value of k Fixed input
parameter.
Fixed input parameter Fixed input parameter Initial value given as
input, algorithm
dynamically iterates
and determines best
value of k for given
data.
Execution
Time
More. Less as compared to
Original k-Means, but
more than other two
algorithms
Less than Original k-
Means and Modified
Approach I but more
than Proposed
Algorithm.
Less than all other three
algorithms.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
218
IX. CONCLUSION
In this paper we presented different approaches to k-Means clustering that are
concluded using comparative presentation and stressing their pros and cons. Another issue
discussed in the paper is clustering validity which we measured using silhouette validity
index. For a given dataset this index shows which value of k produces compact and well
separated clusters. The paper presents a new method for selecting initial centers and dynamic
approach to k-Means clustering so that user needs not to check the clusters for different
values of k. Instead inputs initial value of k and algorithm stops after it finds best value of k
i.e. algorithm stops when it attains maximum silhouette value. Experiments also show that the
proposed dynamic algorithm takes much less computation time then the other three
algorithms discussed in the paper.
X. ACKNOWLEDGEMENTS
I am eternally grateful to my research supervisor Dr. M.P.S Bhatia for their
invigorating support and valuable suggestions and guidance. I thank him for his supervision
and patiently correcting me with my work. It is a great experience and I gained a lot here.
Finally, we are thankful to the almighty God who had given us the power, good sense and
confidence to complete my research analysis successfully. I also thank my parents and my
friends who were a constant source of encouragement. I would also like to thanks Navneet
Singh for his appreciation.
REFERENCES
Proceedings Papers
[1] Ran Vijay Singh and M.P.S Bhatia, “Data Clustering with Modified K-means
Algorithm”, IEEE International Conference on Recent Trends in Information Technology,
ICRTIT 2011, pp 717-721.
[2] D. Napoleon and P. Ganga lakshmi, “An Efficient K-Means Clustering Algorithm for
Reducing Time Complexity using Uniform Distribution Data Points”, IEEE 2010.
Journal Papers
[3] Tajunisha and Saravanan, “Performance Analysis of k-means with different
initialization methods for high dimensional data” International Journal of Artificial
Intelligence & Applications (IJAIA), Vol.1, No.4, October 2010
[4] Neha Aggarwal and Kriti Aggarwal,”A Mid- point based k –mean Clustering
Algorithm for Data Mining”. International Journal on Computer Science and Engineering
(IJCSE) 2012.
[5] Barileé Barisi Baridam,” More work on k-means Clustering algortithm: The
Dimensionality Problem ”. International Journal of Computer Applications (0975 –
8887)Volume 44– No.2, April 2012.
Proceedings Papers
[6] Shi Na, Li Xumin, Guan Yong “Research on K-means clustering algorithm”. Proc of
Third International symposium on Intelligent Information Technology and Security
Informatics, IEEE 2010.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
219
[7] Ahamad Shafeeq and Hareesha ”Dynamic clustering of data with modified K-mean
algorithm”, Proc. International Conference on Information and Computer Networks
(ICICN 2012) IPCSIT vol. 27 (2012) © (2012) IACSIT Press, Singapore 2012.
Research
[8] Kohei Arai,Ali Ridho Barakbah, Hierarchical K-means: an algorithm for centroids
initialization for k-Means. Reports of the faculty of Science and Engineering, Saga
University, Vol. 26, No. 1, 2007.
Books
[1] Jiawei Han and Micheline Kamber, data mining concepts and techniques (Second
Edition).

Contenu connexe

Tendances

Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGA HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGcscpconf
 
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...IRJET Journal
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
Analysis and implementation of modified k medoids
Analysis and implementation of modified k medoidsAnalysis and implementation of modified k medoids
Analysis and implementation of modified k medoidseSAT Publishing House
 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimizationAlexander Decker
 
IRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction DataIRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction DataIRJET Journal
 
Fault diagnosis using genetic algorithms and principal curves
Fault diagnosis using genetic algorithms and principal curvesFault diagnosis using genetic algorithms and principal curves
Fault diagnosis using genetic algorithms and principal curveseSAT Journals
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal
 
Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsFinding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
 
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering TechniquesIRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering TechniquesIRJET Journal
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016ijcsbi
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
 
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...csandit
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
 
Survey of K means Clustering and Hierarchical Clustering for Road Accident An...
Survey of K means Clustering and Hierarchical Clustering for Road Accident An...Survey of K means Clustering and Hierarchical Clustering for Road Accident An...
Survey of K means Clustering and Hierarchical Clustering for Road Accident An...IRJET Journal
 
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...ijseajournal
 

Tendances (20)

Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
20320130406025
2032013040602520320130406025
20320130406025
 
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGA HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
 
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Analysis and implementation of modified k medoids
Analysis and implementation of modified k medoidsAnalysis and implementation of modified k medoids
Analysis and implementation of modified k medoids
 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimization
 
IRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction DataIRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction Data
 
50120130406039
5012013040603950120130406039
50120130406039
 
Fault diagnosis using genetic algorithms and principal curves
Fault diagnosis using genetic algorithms and principal curvesFault diagnosis using genetic algorithms and principal curves
Fault diagnosis using genetic algorithms and principal curves
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsFinding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster Results
 
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering TechniquesIRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
50120130406008
5012013040600850120130406008
50120130406008
 
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
 
Survey of K means Clustering and Hierarchical Clustering for Road Accident An...
Survey of K means Clustering and Hierarchical Clustering for Road Accident An...Survey of K means Clustering and Hierarchical Clustering for Road Accident An...
Survey of K means Clustering and Hierarchical Clustering for Road Accident An...
 
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
JAVA BASED VISUALIZATION AND ANIMATION FOR TEACHING THE DIJKSTRA SHORTEST PAT...
 

Similaire à Dynamic approach to k means clustering algorithm-2

New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
 
A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving dataiaemedu
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataIRJET Journal
 
Machine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with RMachine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with RIRJET Journal
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
 
Variance rover system web analytics tool using data
Variance rover system web analytics tool using dataVariance rover system web analytics tool using data
Variance rover system web analytics tool using dataeSAT Publishing House
 
Variance rover system
Variance rover systemVariance rover system
Variance rover systemeSAT Journals
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...IJECEIAES
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Dataidescitation
 

Similaire à Dynamic approach to k means clustering algorithm-2 (20)

New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving data
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Af4201214217
Af4201214217Af4201214217
Af4201214217
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
 
Machine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with RMachine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with R
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Variance rover system web analytics tool using data
Variance rover system web analytics tool using dataVariance rover system web analytics tool using data
Variance rover system web analytics tool using data
 
Variance rover system
Variance rover systemVariance rover system
Variance rover system
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
A PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering AlgorithmA PSO-Based Subtractive Data Clustering Algorithm
A PSO-Based Subtractive Data Clustering Algorithm
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
 
Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 

Plus de IAEME Publication

IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME Publication
 
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...IAEME Publication
 
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSA STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSIAEME Publication
 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSBROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSIAEME Publication
 
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSDETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSIAEME Publication
 
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSIAEME Publication
 
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOVOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOIAEME Publication
 
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IAEME Publication
 
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYVISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYIAEME Publication
 
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...IAEME Publication
 
GANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEGANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEIAEME Publication
 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...IAEME Publication
 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...IAEME Publication
 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...IAEME Publication
 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...IAEME Publication
 
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...IAEME Publication
 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...IAEME Publication
 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...IAEME Publication
 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...IAEME Publication
 
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTA MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTIAEME Publication
 

Plus de IAEME Publication (20)

IAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdfIAEME_Publication_Call_for_Paper_September_2022.pdf
IAEME_Publication_Call_for_Paper_September_2022.pdf
 
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
 
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURSA STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
 
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURSBROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
 
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONSDETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
 
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONSANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
 
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINOVOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
 
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
 
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMYVISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
 
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
 
GANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICEGANDHI ON NON-VIOLENT POLICE
GANDHI ON NON-VIOLENT POLICE
 
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
 
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
 
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
 
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
 
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
 
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
 
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
 
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
 
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENTA MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
 

Dernier

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 

Dernier (20)

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 

Dynamic approach to k means clustering algorithm-2

  • 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 204 DYNAMIC APPROACH TO k-Means CLUSTERING ALGORITHM Deepika Khurana1 and Dr. M.P.S Bhatia2 1 (Department of Computer Engineering, Netaji Subhas Institute of Technology, University of Delhi, New Delhi, India) 2 (Department of Computer Engineering, Netaji Subhas Institute of Technology, University of Delhi, New Delhi, India) ABSTRACT k-Means clustering algorithm is a heuristic algorithm that partitions the dataset into k clusters by minimizing the sum of squared distance in each cluster. In contrast, there are number of weaknesses. First it requires a prior knowledge of cluster number ‘k’. Second it is sensitive to initialization which leads to random solutions. This paper presents a new approach to k-Means clustering by providing a solution to initial selection of cluster centroids and a dynamic approach based on silhouette validity index. Instead of running the algorithm for different values of k, the user need to give only initial value of k as ko as input and algorithm itself determines the right number of clusters for a given dataset. The algorithm is implemented in the MATLAB R2009b and results are compared with the original k-Means algorithm and other modified k-Means clustering algorithms. The experimental results demonstrate that our proposed scheme improves the initial center selection and overall computation time. Keywords: Clustering, Data mining, Dynamic, k-Means, Silhouette validity index. I. INTRODUCTION Data Mining is defined as mining of knowledge from huge amount of data. Using Data mining we can predict the nature and behaviour of any kind of data. It was recognized that information is at the heart of the business operations and that decision makers could make the use of data stored to gain the valuable insight into the business. DBMS gave access to the data stored but this was only small part of what could be gained from the data. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 3, May-June (2013), pp. 204-219 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET © I A E M E
  • 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 205 Analyzing data can further provide the knowledge about the business by going beyond the data explicitly stored to derive knowledge about the business. Learning valuable information from the data made clustering techniques widely applied to the areas of artificial intelligence, customer – relationship management, data compression, data mining, image processing, machine learning, pattern recognition, market analysis, and fraud-detection and so on. Cluster Analysis of a data is an important task in Knowledge Discovery and Data Mining. Clustering is the process to group the data on the basis of similarities and dissimilarities among the data elements. Clustering is the process of finding the group of objects such that object in one group will be similar to one another and different from the objects in the other group. Clustering is an unsupervised algorithm, which requires a parameter that specifies the number of clusters k. For setting this parameter either requires detailed knowledge of the dataset or requires the algorithm to be run for different values of k to determine the correct number of clusters. However for large and multidimensional data process of clustering becomes time consuming and determining the correct number of clusters in large data becomes difficult. The k-Means clustering algorithm is an old algorithm that has been intensely researched, owing to its ease and simplicity of implementation. However there have also been criticisms on its performance, in particularly for demanding the value of k in prior. It is evident from the previous researches that providing the number of clusters in prior does not in any way assist in the production of good quality clusters. Original k-Means also determines the initial centers randomly in each run which leads to different solutions. To validate the clustering results we have chosen Silhouette validity index as a validity measure. The Silhouettes validity index is particularly useful when seeking to know the number of clusters that are compact and well separated. This index is used after the clustering to check the validity of clusters produced. This paper presents a new method for selection of the initial k centers and a dynamic approach to k-Means clustering. Initial value of k as ko is provided by the user. The algorithm will then partition the whole space into different segments and calculate the frequency of data points in each segment. The ko highest frequency segments are then chosen as initial ko clusters. To determine the initial centers, the algorithm will calculate for each segment the distance of points from origin; sort them and then coordinates corresponding to the mid value of the distance is chosen to be the center for that segment. Then cluster assignment process is done. Then the Silhouettes validity index is calculated for the initial ko clusters. This step is then repeated for ( ko +2) and (ko -2) number of clusters. The algorithm will then iterate again for specified conditions and stop at the maximum value of silhouette index yielding k correct number of clusters. The proposed approach is dynamic in the sense that user need not to check the algorithm for different values of k. Instead the algorithm stops itself at best value of k giving compact and separated clusters. Proposed algorithm shows that it takes less execution time when compared with Original k-Means and modified approach to k-Means clustering. The paper is organised as follows: Section 2 presents related work. Silhouette validity index is discussed in 3. Section 4 describes Original k-Means. 5 and 6 sections details the approaches discussed in [1] and [2] respectively. Section 7 describes the proposed algorithm. Section 8 shows implementation results. Conclusion and future work is presented in section 9.
  • 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 206 II. RELATED WORK In literature [1] there is an improved k-Means algorithm based on the improvement of the sensitivity of the initial centers. This algorithm partitions the whole data space into different segments and calculates the frequency of points in each segment. The segment which shows the maximum frequency will b considered for initial centroid depending upon the value of k. In literature [2] another method of finding initial cluster centers is discussed. It first finds closest pair of data points and then on the basis of these points it forms the subset of dataset, and this process is repeated k times to find k small subsets, to find initial k centroids. The author in literature [3] uses Principal Component Analysis for dimension reduction and to find initial cluster centers. In [4] first data set is pre-processed by transforming all data values to positive space and then data is sorted and divided into k equal sets and then middle value of each set is taken as initial center point. In literature[5] a dynamic solution to k –Means is proposed that algorithm is designed with pre-processor using silhouette validity index that automatically determines the appropriate number of clusters, that increase the efficiency for clustering to a great extent. In [6] a method is proposed to make algorithm independent of number of iterations that avoids computing distance of each data point to cluster centers repeatedly, saving running time and reducing computational complexity. In the literature [7] dynamic means algorithm is proposed to improve the cluster quality and optimizing the number of clusters. The user has the flexibility either to fix the number of clusters or input the minimum number of clusters required. In the former case it works same as k-Means algorithm. In the latter case the algorithm computes the new cluster centers by incrementing the cluster count by one in every iteration until it satisfies the validity of cluster quality In [8] the main purpose is to optimize the initial centroids for k-Means algorithm.Author proposed Hierarchical k-Means algorithm. It utilizes all the clustering results of k-Means in certain times, even though some of them reach the local optima. Then, transform the all centroids of clustering result by combining with Hierarchical algorithm in order to determine the initial centroids for k-Means. This algorithm is better used for the complex clustering cases with large data set and many dimensional attributes. III. SILHOUETTE VALIDITY INDEX The Silhouette value for each point is a measure of how similar that point is to the points in its own cluster compared to the points in other clusters. This technique computes the silhouette width for each data point, silhouette width for each cluster and overall average silhouette width. The silhouette width for the ith point of mth cluster is given by equation 1: ࡿ࢏ሺ࢓ሻ ൌ ࢈࢏ିࢇ࢏ ࢓ࢇ࢞ሺ࢈࢏,ࢇ࢏ሻ ሺ૚ሻ Where ai is the average distance from the ith point to the other points in its cluster and bi is the minimum of the average distance from point i to the points in the other k-1 clusters. It
  • 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 207 ranges from -1 to +1. Every point i with a silhouette index close to 1 indicates that it belongs to the cluster being assigned. A value of zero indicates object could also be assigned to another closest cluster. A value of close to -1 indicates that object is wrongly clustered or is somewhere between clusters. The silhouette width for each cluster is given by equation 2: ࡿሺ࢓ሻ ൌ ૚ ࢔ሺ࢓ሻ ∑ ࡿ࢏ሺ࢓ሻ ࢔ሺ࢓ሻ ࢏ୀ૚ ሺ૛ሻ The overall average silhouette width is given by equation 3: ࡿ ൌ ૚ ࢑ ∑ ࡿሺ࢓ሻ ሺ૜ሻ࢑ ࢓ୀ૚ We have used this silhouette validity index as a measure of cluster validity in the implementation of Original k –Means, modified approach I and modified approach II. We have used this measure as a basis to make the proposed algorithm work dynamically. . IV. ORIGINAL K-MEANS ALGORITHM The k-Means algorithm takes the input parameter k, and partition a set of n objects into k clusters so that the resulting intra-cluster similarity is high but the inter-cluster similarity is low cluster similarity is measured in regard to the mean value of the objects in a cluster which can be viewed as a cluster’s centroid or center of gravity. The k-means algorithm proceeds as follows: 1. Randomly select k of the objects, each of which initially represents a cluster mean or center. 2. For each of the remaining objects, an object is assigned to a cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster using equation 4: 3. ࡹ࢐ ൌ ૚ ࢔࢐ ෍ ࢆ࢖ ࢺࢆ࢖‫࡯א‬࢏ ሺ૝ሻ Where, Mj is centroid of cluster j and nj is the number of data points in cluster j. 4. This process iterates until the criterion function converges. Typically the square – error criterion is used, defined using equation 5: ࡱ ൌ ෍ ෍|࢖ െ ࢓࢏| ሺ૞ሻ ࢖‫࡯א‬࢏ ࢑ ࢏ୀ૚ Where p is the data point and mi is the center for cluster Ci. E is the sum of squared error of all points in dataset. The distance of criterion function is the Euclidean distance which is used to calculate the distance between data point and cluster center.
  • 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 208 The Euclidean distance between two vectors x = (x1, x2 , x3 , x4-------- xn) and y= (y1 , y2 , y3 , y4 ---- yn) can be calculated using equation 6: ࢊሺ࢞࢏, ࢟࢏ሻ ൌ ෍ ඥሺ࢞࢏ െ ࢟࢏ሻ૛ ࢔ ࢏ୀ૚ ሺ૟ሻ Algorithm: The k –Means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster. Input: • k: the number of clusters, • D: a data set containing n objects. Output: A set of k clusters. Method: 1. arbitrarily choose k objects from D as the initial cluster centers; 2. repeat 3. (re)assign each object to the cluster to which the object is most similar, based on the mean value of the objects in the cluster; 4. update the cluster means , i.e., calculate the mean value of objects for each cluster; 5. until no change; V. MODIFIED APPROACH I The first approach discussed in [1] optimizes the Original k –Means algorithm by proposing a method on how to choose initial clusters. The author proposed a method that partitions the given input data space into k * k segments, where k is desired number of clusters. After portioning the data space, frequency of each segment is calculated and highest k frequency segments are chosen to represent initial clusters. If some parts are having same frequency, the adjacent segments with the same least frequency are merged until we get the k number of segments. Then initial centers are calculated by taking the mean of the data points in respective segments to get the initial k centers. By this process we will get the initial which are always same as compared to the Original k – Means algorithm which always selects initial random centers for a given dataset. Next, a threshold distance is calculated for each centroid is defined as distance between each cluster centroid and for each centroid take the half of the minimum distance from the remaining centroids. Threshold distance is denoted by dc(i) for the cluster C i . To assign the data point to the cluster, take a point p in the dataset and calculate its distance from the centroid of cluster i and compare it with dc(i) . If it is less than or equal to dc(i) then assign the data point p to the cluster i else calculate its distance from other centroids. This process is repeated until data point p is assigned to one of the cluster. If data point p is not assigned to any of the cluster then the centroid which shows minimum distance for the data point p becomes the centroid for that point. The centroid is then updated by calculating the mean of the data points in the cluster.
  • 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 209 Pseudo code for Modified k-Means algorithm is as follows: Input: Dataset of N data points D (i = 1 to N) Desired number of clusters = k Output: N data points clustered into k clusters. Steps: 1. Input the data set and value of k. 2. If the value of k is 1 then Exit. 3. Else 4. /*divide the data point space into k*k, means k Vertically and k horizontally*/ 5. For each dimension{ 6. Calculate the minimum and maximum value of data Points. 7. Calculate range of group(RG) using equation 7: ࡾࡳ ൌ ሺ࢓࢏࢔ ൅ ࢓ࢇ࢞ሻ ࢑ ሺૠሻ 8. Divide the data point space in k group with width RG 9. } 10. Calculate the frequency of data points in each partitioned space. 11. Choose the k highest frequency group. 12. Calculate the mean of selected group. /* These will be the initial centroids of k clusters.*/ 13. Calculate the distance between each clusters using equation 8: ࢊ൫࡯࢏, ࡯࢐൯ ൌ ൛ࢊ൫࢓࢏, ࢓࢐൯: ሺ࢏, ࢐ሻ ‫א‬ ሾ૚, ࢑ሿ & ݅ ് ݆ൟ ሺૡሻ Where d(C i, C j) is distance between centroid i and j 14. Take the minimum distance for each cluster and make it half using equation 9: ࢊࢉሺ࢏ሻ ൌ ૚ ૛ ൛࢓࢏࢔ൣ ࢊ൫࡯࢏, ࡯࢐൯, … … … … … ൧ൟ ሺૢሻ Where, dc(i) is half of the minimum distance of i th cluster from other remaining clusters. 15. For each data points Zp= 1 to N { 16. For each cluster j= 1 to k { 17. Calculate d(Zp,Mj) using equation 10: ࢊሺ࢞࢏, ࢟࢏ሻ ൌ ∑ ሺ࢞࢏ െ ࢟࢏ሻ૛࢔ ࢏ୀ૚ (10) where d(xi,yi) is the distance between vector vectors x = (x1, x2 , x3 , x4-------- xn) and y= (y1 , y2 , y3 , y4 ---- yn). 18. If (d(Zp,Mj)) ≤ dcj){ 19. Then Zp assign to cluster Cj . 20. Break; 21. } 22. Else 23. Continue;
  • 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 210 24. } 25. If Zp, does not belong to any cluster then 26. Zp, ϵ min(d(Zp, , Mi)) where iϵ [1, k] 27. } 28. Check the termination condition of algorithm if Satisfied 29. Exit. 30. Else 31. Calculate the centroid of cluster using equation 11: ࡹ࢐ ൌ ૚ ࢔࢐ ෍ ࢆ࢖ ሺ૚૚ሻ ࢺࢆ࢖‫࡯א‬࢐ Where Mj is centroid of cluster j and nj is the number of data points in cluster j. 32. Go to step 13. VI. MODIFIED APPROACH II In the work of [2], author calculate the distance of between each data points and select that pair which show the minimum distance and remove it from actual dataset. Then took one data point from data set and calculate the distance between selected initial point and data point from data set and add with initial data point which show the minimum distance. Repeat this process till threshold value achieved. If number of subsets formed is less than k then again calculate the distance between each data point from the rest data set and repeat that process till k cluster formed. First phase is to determine initial centroids, for this compute the distance between each data point and all other data points in the set D. Then find out the closest pair of data points and form a set A1 consisting of these two data points, and delete them from the data point set D. Then determine the data point which is closest to the set A1, add it to A1 and delete it from D. Repeat this procedure until the number of elements in the set A1 reaches a threshold. Then again form another data-point set A2. Repeat this till ’k’ such sets of data points are obtained. Finally the initial centroids are obtained by averaging all the vectors in each data-point set. The Euclidean distance is used for determining the closeness of each data point to the cluster centroids Next phase is to assign points to the clusters. Here the main idea is to set two simple data structures to retain the labels of cluster and the distance of all the data objects to the nearest cluster during the each iteration, that can be used in next iteration, we calculate the distance between the current data object and the new cluster center, if the computed distance is smaller than or equal to the distance to the old center, the data object stays in it’s cluster that was assigned to in previous iteration. Therefore, there is no need to calculate the distance from this data object to the other k- 1clustering center, saving the calculative time to the k-1 cluster centers. Otherwise, we must calculate the distance from the current data object to all k cluster centers, and find the nearest cluster center and assign this point to the nearest cluster center. And then we separately record the label of nearest cluster center and the distance to its center. Because in each iteration some data points still remain in the original cluster, it means that some parts of the data points will not be calculated, saving a total time of calculating the distance, thereby enhancing the efficiency of the algorithm.
  • 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May Pseudo code for modified k- Means algorithm is as follows: Input: Dataset D of N data points (i = 1 to N) Desired number of clusters = k Output: N data points clustered into k clusters. Phase 1: Steps: 1. Set m = 1; 2. Compute the distance between each data point and all other data set D using equation where d(xi,yi) is the distance between vector and y= (y1 , y2 , y3 3. Find the closest pair of data points from the set D and form a data (1<= m <= k) which contains these two data points from the set D; 4. Find the data point in D that is closest to the data point set A and delete it from D; 5. Repeat step 4 until the number of data points in A 6. If m<k, then m = m+1, find another pair of data points from D between which the distance is the shortest, form another data from D, Go to step 4; 7. For each data-point set Am (1<=m<=k) find the arithmetic mean of the vect of data points in A Phase 2: Steps: 1. Compute the distance of each data (1<=j<=k) as d(di, 2. For each data-point 3. Set Cluster Id[i]=j; /* j: Id of the closest cluster for point i */. 4. Set Nearest _Dist[i]= 5. For each cluster j 6. Repeat 7. For each data-point d a. Compute its distance from the centr b. If this distance is less than or equal to the present nearest distance, the data-point stays in the cluster; c. Else for every centroid d. End for; 8. Assign the data-point 9. Set ClusterId[i]=j; 10. Set Nearest_Dist[i] = 11. End for (step(2)); 12. For each cluster j criteria is met i.e. either no center updates or no point moves to cluster. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 211 Means algorithm is as follows: Dataset D of N data points (i = 1 to N) Desired number of clusters = k N data points clustered into k clusters. Compute the distance between each data point and all other data- set D using equation 12: (12) he distance between vector vectors x = (x1, x2 , x 3 , y4 ---- yn). Find the closest pair of data points from the set D and form a data = k) which contains these two data- points, Delete these two data points from the set D; Find the data point in D that is closest to the data point set Am, Add it to A and delete it from D; Repeat step 4 until the number of data points in Am reaches 0.75* If m<k, then m = m+1, find another pair of data points from D between which the distance is the shortest, form another data-point set Am and delete them from D, Go to step 4; point set Am (1<=m<=k) find the arithmetic mean of the vect of data points in Am, these means will be the initial centroids Compute the distance of each data-point di (1<=i<=N) to all the centroids C , Cj) using equation (4.1) point di, find the closest centroid Cj and assign di to cluster Set Cluster Id[i]=j; /* j: Id of the closest cluster for point i */. Set Nearest _Dist[i]= d(di, Cj); j (1<=j<=k), recalculate the centroids; point di, Compute its distance from the centroid of the present nearest cluster; If this distance is less than or equal to the present nearest distance, the point stays in the cluster; Else for every centroid cj (1<=j<=k) compute the distance point di to the cluster with the nearest centroid Cj Set ClusterId[i]=j; Set Nearest_Dist[i] = d(di, Cj); End for (step(2)); j (1<=j<=k), Recalculate the centroids until the convergence criteria is met i.e. either no center updates or no point moves to International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- June (2013), © IAEME - points in the ) , x3 , x4-------- xn) Find the closest pair of data points from the set D and form a data-point set Am points, Delete these two data , Add it to Am reaches 0.75*(N/k); If m<k, then m = m+1, find another pair of data points from D between which point set Am and delete them point set Am (1<=m<=k) find the arithmetic mean of the vectors (1<=i<=N) to all the centroids Cj to cluster j. oid of the present nearest cluster; If this distance is less than or equal to the present nearest distance, the (1<=j<=k) compute the distance d(di, Cj); (1<=j<=k), Recalculate the centroids until the convergence criteria is met i.e. either no center updates or no point moves to another
  • 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 212 VII. PROPOSED APPROACH The changes are based on the selection of initial k centers and making the algorithm to work dynamically, i.e. instead of running algorithms for different values of k, we try to make algorithm in such a way that it itself decides how many clusters are there in a given dataset. The two modifications are as follows: • A method to select initial k centers. • To make algorithm dynamic. Proposed Algorithm: The algorithm consists of three phases: First phase is to determine the initial k centroids. In this phase user inputs the dataset and value of k. The data space is divided into k*k segments as discussed in [1]. After dividing the data space we choose the k segments with the highest frequency of points. If some parts are having same frequency, the adjacent segments with the same least frequency are merged until we get the k number of segments. Then we find the distance of each point in each selected segment with the origin and these distances are sorted for each segment and then middle point is selected as the center for that segment. This step is repeated for each k selected segments. These represent the initial k centroids. Second phase is to assign points to the cluster based on the minimum distance between the point and the cluster centers. The distance measure used is Euclidean distance. It then computes the mean of the clusters formed as the next centers. This process is repeated until no more center updates. Third phase is where algorithm iterates dynamically to determine right number of clusters. To choose the right number of clusters we use the concept the concept of Silhouette Validity Index. Pseudo Code for Proposed Algorithm: Input: Dataset of N points. Desired number of k clusters. Output: N points grouped into k clusters. Phase1: Finding Initial centroids Steps: 1. Input the dataset and value of k ≥ 2. 2. Divide the data point set into k*k segments /*k vertically and k horizontally*/ 3. For each dimension { 4. Calculate the minimum and maximum value of data points. 5. Calculate the width (Rg) using equation 13: ࡾࢍ ൌ ࢓࢏࢔ା࢓ࢇ࢞ ࢑ ሺ૚૜ሻ } 6. Calculate the frequency of data points in each segment. 7. Choose the k highest frequency segments. 8. For each segment i = 1 to k { 9. For each point j in the segment i
  • 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 213 { 10. Calculate the distance of point j with origin } 11. Sort these distances in ascending order in matrix D 12. Select the middle point distance. 13. The co-ordinates corresponding to the distance in 12 is chosen as initial center for the ith cluster. } 14. These k co-ordinates are stored in matrix C which represents the initial centroids. Phase2: Assigning points to the cluster Steps: 1. Repeat 2. For each data point p = 1 to N { 3. For each cluster j = 1 to k { 4. Calculate distance between point p and cluster centroid cj of Cj using equation 14: ࢊ൫࢖ , ࢉ࢐൯ ൌ ටሺ࢖ െ ࢉ࢐ሻ૛૛ (14) } } 5. Assign p to min{d(p,cj)}where j [1,k]. 6. Check the termination condition of the algorithm if Satisfied 7. Exit 8. Else 9. Calculate the new centroids of cluster using 15: ࢉ࢐ ൌ ૚ ࢔࢐ ∑ ࢖ ሺ૚૞ሻࢺ࢖‫࡯א‬࢐ Where nj is the number of points in cluster‫ܥ‬௝. 10. Go to step 1. Phase3: To determine appropriate number of clusters For the given value o the phase 1 and 2 are run for three iterations using k-2, k and k +2. Three corresponding Silhouette values are calculated as discussed in section 2. These are denoted by Sk-2, Sk, Sk+2. The appropriate number of clusters is then found using following steps. Steps: 1. If Sk-2 < Sk and Sk > Sk+2 then run phase 1 and phase 2 using k+1 and k-1 and corresponding Sk+1 and Sk-1 are found. The maximum of the three Sk-1, Sk, Sk+1 then determines the value of k as appropriate number of clusters. For example if Sk+1 is maximum, then number of clusters formed by the algorithm is k+1.
  • 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 214 2. Else If Sk+2 > Sk and Sk+2 > Sk-2 then run phase 1 and phase 2 using k+1, k+3, k+4 and corresponding Sk+1, Sk+3 and Sk+4 are found. The k values corresponding to maximum of the Sk+1, Sk+2, Sk+3, Sk+4 is returned. 3. Else If Sk+2 < Sk-2 and Sk < Sk-2 then run phase 1 and phase 2 using k-1, k-2, k- 3, k-4 and corresponding Sk-1, Sk-2, Sk-3 and Sk-4 are found. The k values corresponding to maximum of the Sk-1, Sk-2, Sk-3, Sk-4 is returned. 4. Stop. Thus the algorithm terminates itself where the best value of k is found. This value of k shows appropriate number of clusters for a given data set. VIII. RESULT ANALYSIS The proposed algorithm is implemented and results are compared with that of modified approach [1] and [2] in terms of execution time and initial centers chosen. 1. The total time taken by the algorithm to form clusters and dynamically determining the appropriate number of clusters is actually less than the total time taken by the algorithm [1] to run for different values of k. For example if we run algorithm in [1] for different values of k such as k = 2, 3, 4, 5, 6, 7, etc. The algorithm in [1] takes more time as compared to the proposed algorithm which itself runs for different values of k. 2. We define new method to determine initial centers that is based on middle value as compared to mean value. The reason behind this is that the middle value best represents the distribution and moreover as mean is influenced by too large and too small values, the middle value is not affected by this. The results show that algorithm works dynamically and is also an improvement over original k-Means. Table I shows results of running algorithm in [1] over wine dataset from UCI repository for k = 3, 4, 5, 6, 7 and 9. The algorithm is run for these values of k because in proposed algorithm we initially fed k =7 and algorithm runs for these values of k automatically and so total execution time of both algorithms are compared. And results shows that proposed algorithm take less time than running the algorithm [1] individually for different values of k. TABLE I: RESULTS OF ALGORITHM [1] FOR WINE DATASET Sr. no. Value of k Silhouette validity Index Execution time (s) 1. 3 0.4437 3.92 2. 4 0.3691 4.98 3. 5 0.3223 2.89 4. 6 0.2923 7.51 5. 7 0.2712 3.56 6. 9 0.2082 11.96 TOTAL EXECUTION TIME 34.82 The results for the proposed algorithm show different runs and stops at: maximum silhouette value = 0.443641 for best value of k = 3 Elapsed time is 30.551549 seconds.
  • 12. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 215 Thus, from results it is clear that algorithm stops itself when it finds the right number of clusters. It shows that proposed algorithm takes less time as compared to algorithm in [1]. Table II shows results of execution time for both algorithms. It also shows initial value fed to our algorithm and where the best value of k is found the algorithm stops. Experiments are performed on random datasets of 50, 178 and 500 points. TABLE II COMPARING RESULTS OF PROPOSED ALGORITHM & ALGORITHM IN [1] Sr. no. Dataset Initial value of k Best value of k Proposed algorithm time (s) Algorithm [1] time (s) 1. 50 points 4 6 18.6084 28.0096 2. 178 points 9 10 39.1941 50.2726 3. 500 points 5 4 66.6134 91.9400 Table III shows comparison results between original k-means, modified approach II and proposed algorithm. When comparing execution times of proposed algorithm with other algorithms, it is seen that proposed algorithm takes much less time than the original k-Means and for the large dataset such as dataset of 500 points; the proposed algorithm also outperforms the modified approach II. TABLE- III EXECUTION TIME(s) COMPARISON Sr. No. Dataset Original k- Means Modified approach II Proposed algorithm 1. 50 points 15.1727 11.4979 18.6084 2. 178 points 74.5168 21.6497 39.1941 3. 500 points 86.7619 87.2461 66.6134 From all the results we can conclude that although procedure of proposed algorithm is long but it prevents user from running the algorithm for different values of k as in other three algorithms discussed in previous chapters. The proposed algorithm dynamically iterates and stops at best value of k. Figure I –III shows different silhouette plots for all 3 datasets of random points discussed above depicting how close a point to the other members of its own cluster is. The plot also shows that if any point is not placed incorrect cluster if the silhouette index value for that point is negative. Figure IV shows execution time comparison of all the algorithms discussed in the paper.
  • 13. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 216 Figure 1 Silhouette plot for 50 points Figure 2 Silhouette plot for 178 points Figure 3 Silhouette plot for 500 points
  • 14. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 217 Figure 4 Execution time (s) comparison of different algorithms Table 4 shows comparison between all four algorithms discussed in paper on the basis of experiments results. TABLE 1V: COMPARISON OF ALGORITHMS Parameters Original k- Means Modified Approach I Modified Approach II Proposed Algorithm Initial Centers Always random and thus different clusters for same value of k for given data. Way to select initial centers is fixed by always choosing initial centers in the highest frequency segment. Selection of initial centers by always choosing points based on the similarity between points. Initial centers are fixed by choosing the centers in the highest frequency segment, which is middle point of that segment points calculated from origin. Redundant Data Can work with data redundancies. Suitable. Not suitable for data with redundant points. Can work with redundant data. Dead Unit Problem Yes. No. No. No. Value of k Fixed input parameter. Fixed input parameter Fixed input parameter Initial value given as input, algorithm dynamically iterates and determines best value of k for given data. Execution Time More. Less as compared to Original k-Means, but more than other two algorithms Less than Original k- Means and Modified Approach I but more than Proposed Algorithm. Less than all other three algorithms.
  • 15. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 218 IX. CONCLUSION In this paper we presented different approaches to k-Means clustering that are concluded using comparative presentation and stressing their pros and cons. Another issue discussed in the paper is clustering validity which we measured using silhouette validity index. For a given dataset this index shows which value of k produces compact and well separated clusters. The paper presents a new method for selecting initial centers and dynamic approach to k-Means clustering so that user needs not to check the clusters for different values of k. Instead inputs initial value of k and algorithm stops after it finds best value of k i.e. algorithm stops when it attains maximum silhouette value. Experiments also show that the proposed dynamic algorithm takes much less computation time then the other three algorithms discussed in the paper. X. ACKNOWLEDGEMENTS I am eternally grateful to my research supervisor Dr. M.P.S Bhatia for their invigorating support and valuable suggestions and guidance. I thank him for his supervision and patiently correcting me with my work. It is a great experience and I gained a lot here. Finally, we are thankful to the almighty God who had given us the power, good sense and confidence to complete my research analysis successfully. I also thank my parents and my friends who were a constant source of encouragement. I would also like to thanks Navneet Singh for his appreciation. REFERENCES Proceedings Papers [1] Ran Vijay Singh and M.P.S Bhatia, “Data Clustering with Modified K-means Algorithm”, IEEE International Conference on Recent Trends in Information Technology, ICRTIT 2011, pp 717-721. [2] D. Napoleon and P. Ganga lakshmi, “An Efficient K-Means Clustering Algorithm for Reducing Time Complexity using Uniform Distribution Data Points”, IEEE 2010. Journal Papers [3] Tajunisha and Saravanan, “Performance Analysis of k-means with different initialization methods for high dimensional data” International Journal of Artificial Intelligence & Applications (IJAIA), Vol.1, No.4, October 2010 [4] Neha Aggarwal and Kriti Aggarwal,”A Mid- point based k –mean Clustering Algorithm for Data Mining”. International Journal on Computer Science and Engineering (IJCSE) 2012. [5] Barileé Barisi Baridam,” More work on k-means Clustering algortithm: The Dimensionality Problem ”. International Journal of Computer Applications (0975 – 8887)Volume 44– No.2, April 2012. Proceedings Papers [6] Shi Na, Li Xumin, Guan Yong “Research on K-means clustering algorithm”. Proc of Third International symposium on Intelligent Information Technology and Security Informatics, IEEE 2010.
  • 16. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 219 [7] Ahamad Shafeeq and Hareesha ”Dynamic clustering of data with modified K-mean algorithm”, Proc. International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) © (2012) IACSIT Press, Singapore 2012. Research [8] Kohei Arai,Ali Ridho Barakbah, Hierarchical K-means: an algorithm for centroids initialization for k-Means. Reports of the faculty of Science and Engineering, Saga University, Vol. 26, No. 1, 2007. Books [1] Jiawei Han and Micheline Kamber, data mining concepts and techniques (Second Edition).