Dynamic approach to k means clustering algorithm-2

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
204
DYNAMIC APPROACH TO k-Means CLUSTERING ALGORITHM
Deepika Khurana1
and Dr. M.P.S Bhatia2
1
(Department of Computer Engineering, Netaji Subhas Institute of Technology, University of
Delhi, New Delhi, India)
2
(Department of Computer Engineering, Netaji Subhas Institute of Technology, University of
Delhi, New Delhi, India)
ABSTRACT
k-Means clustering algorithm is a heuristic algorithm that partitions the dataset into k
clusters by minimizing the sum of squared distance in each cluster. In contrast, there are
number of weaknesses. First it requires a prior knowledge of cluster number ‘k’. Second it is
sensitive to initialization which leads to random solutions. This paper presents a new
approach to k-Means clustering by providing a solution to initial selection of cluster centroids
and a dynamic approach based on silhouette validity index. Instead of running the algorithm
for different values of k, the user need to give only initial value of k as ko as input and
algorithm itself determines the right number of clusters for a given dataset. The algorithm is
implemented in the MATLAB R2009b and results are compared with the original k-Means
algorithm and other modified k-Means clustering algorithms. The experimental results
demonstrate that our proposed scheme improves the initial center selection and overall
computation time.
Keywords: Clustering, Data mining, Dynamic, k-Means, Silhouette validity index.
I. INTRODUCTION
Data Mining is defined as mining of knowledge from huge amount of data. Using
Data mining we can predict the nature and behaviour of any kind of data. It was recognized
that information is at the heart of the business operations and that decision makers could
make the use of data stored to gain the valuable insight into the business. DBMS gave access
to the data stored but this was only small part of what could be gained from the data.
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING
& TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 3, May-June (2013), pp. 204-219
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E

205
Analyzing data can further provide the knowledge about the business by going beyond the
data explicitly stored to derive knowledge about the business.
Learning valuable information from the data made clustering techniques widely
applied to the areas of artificial intelligence, customer – relationship management, data
compression, data mining, image processing, machine learning, pattern recognition, market
analysis, and fraud-detection and so on. Cluster Analysis of a data is an important task in
Knowledge Discovery and Data Mining. Clustering is the process to group the data on the
basis of similarities and dissimilarities among the data elements. Clustering is the process of
finding the group of objects such that object in one group will be similar to one another and
different from the objects in the other group.
Clustering is an unsupervised algorithm, which requires a parameter that specifies the
number of clusters k. For setting this parameter either requires detailed knowledge of the
dataset or requires the algorithm to be run for different values of k to determine the correct
number of clusters. However for large and multidimensional data process of clustering
becomes time consuming and determining the correct number of clusters in large data
becomes difficult.
The k-Means clustering algorithm is an old algorithm that has been intensely
researched, owing to its ease and simplicity of implementation. However there have also been
criticisms on its performance, in particularly for demanding the value of k in prior. It is
evident from the previous researches that providing the number of clusters in prior does not
in any way assist in the production of good quality clusters. Original k-Means also
determines the initial centers randomly in each run which leads to different solutions.
To validate the clustering results we have chosen Silhouette validity index as a
validity measure. The Silhouettes validity index is particularly useful when seeking to know
the number of clusters that are compact and well separated. This index is used after the
clustering to check the validity of clusters produced.
This paper presents a new method for selection of the initial k centers and a dynamic
approach to k-Means clustering. Initial value of k as ko is provided by the user. The algorithm
will then partition the whole space into different segments and calculate the frequency of data
points in each segment. The ko highest frequency segments are then chosen as initial ko
clusters. To determine the initial centers, the algorithm will calculate for each segment the
distance of points from origin; sort them and then coordinates corresponding to the mid value
of the distance is chosen to be the center for that segment. Then cluster assignment process is
done. Then the Silhouettes validity index is calculated for the initial ko clusters. This step is
then repeated for ( ko +2) and (ko -2) number of clusters. The algorithm will then iterate
again for specified conditions and stop at the maximum value of silhouette index yielding k
correct number of clusters. The proposed approach is dynamic in the sense that user need not
to check the algorithm for different values of k. Instead the algorithm stops itself at best value
of k giving compact and separated clusters. Proposed algorithm shows that it takes less
execution time when compared with Original k-Means and modified approach to k-Means
clustering.
The paper is organised as follows: Section 2 presents related work. Silhouette validity
index is discussed in 3. Section 4 describes Original k-Means. 5 and 6 sections details the
approaches discussed in [1] and [2] respectively. Section 7 describes the proposed algorithm.
Section 8 shows implementation results. Conclusion and future work is presented in section
9.

206
II. RELATED WORK
In literature [1] there is an improved k-Means algorithm based on the improvement of
the sensitivity of the initial centers. This algorithm partitions the whole data space into
different segments and calculates the frequency of points in each segment. The segment
which shows the maximum frequency will b considered for initial centroid depending upon
the value of k.
In literature [2] another method of finding initial cluster centers is discussed. It first
finds closest pair of data points and then on the basis of these points it forms the subset of
dataset, and this process is repeated k times to find k small subsets, to find initial k centroids.
The author in literature [3] uses Principal Component Analysis for dimension
reduction and to find initial cluster centers.
In [4] first data set is pre-processed by transforming all data values to positive space
and then data is sorted and divided into k equal sets and then middle value of each set is taken
as initial center point.
In literature[5] a dynamic solution to k –Means is proposed that algorithm is designed
with pre-processor using silhouette validity index that automatically determines the
appropriate number of clusters, that increase the efficiency for clustering to a great extent.
In [6] a method is proposed to make algorithm independent of number of iterations
that avoids computing distance of each data point to cluster centers repeatedly, saving
running time and reducing computational complexity.
In the literature [7] dynamic means algorithm is proposed to improve the cluster
quality and optimizing the number of clusters. The user has the flexibility either to fix the
number of clusters or input the minimum number of clusters required. In the former case it
works same as k-Means algorithm. In the latter case the algorithm computes the new cluster
centers by incrementing the cluster count by one in every iteration until it satisfies the
validity of cluster quality
In [8] the main purpose is to optimize the initial centroids for k-Means
algorithm.Author proposed Hierarchical k-Means algorithm. It utilizes all the clustering results
of k-Means in certain times, even though some of them reach the local optima. Then,
transform the all centroids of clustering result by combining with Hierarchical algorithm in
order to determine the initial centroids for k-Means. This algorithm is better used for the
complex clustering cases with large data set and many dimensional attributes.
III. SILHOUETTE VALIDITY INDEX
The Silhouette value for each point is a measure of how similar that point is to the
points in its own cluster compared to the points in other clusters. This technique computes the
silhouette width for each data point, silhouette width for each cluster and overall average
silhouette width.
The silhouette width for the ith
point of mth
cluster is given by equation 1:
ࡿ࢏ሺ࢓ሻ ൌ
࢈࢏ିࢇ࢏
࢓ࢇ࢞ሺ࢈࢏,ࢇ࢏ሻ
ሺ૚ሻ
Where ai is the average distance from the ith
point to the other points in its cluster and bi is
the minimum of the average distance from point i to the points in the other k-1 clusters. It

207
ranges from -1 to +1. Every point i with a silhouette index close to 1 indicates that it belongs
to the cluster being assigned. A value of zero indicates object could also be assigned to
another closest cluster. A value of close to -1 indicates that object is wrongly clustered or is
somewhere between clusters.
The silhouette width for each cluster is given by equation 2:
ࡿሺ࢓ሻ ൌ
૚
࢔ሺ࢓ሻ
∑ ࡿ࢏ሺ࢓ሻ
࢔ሺ࢓ሻ
࢏ୀ૚ ሺ૛ሻ
The overall average silhouette width is given by equation 3:
ࡿ ൌ
૚
࢑
∑ ࡿሺ࢓ሻ ሺ૜ሻ࢑
࢓ୀ૚
We have used this silhouette validity index as a measure of cluster validity in the
implementation of Original k –Means, modified approach I and modified approach II. We
have used this measure as a basis to make the proposed algorithm work dynamically. .
IV. ORIGINAL K-MEANS ALGORITHM
The k-Means algorithm takes the input parameter k, and partition a set of n objects
into k clusters so that the resulting intra-cluster similarity is high but the inter-cluster
similarity is low cluster similarity is measured in regard to the mean value of the objects in a
cluster which can be viewed as a cluster’s centroid or center of gravity.
The k-means algorithm proceeds as follows:
1. Randomly select k of the objects, each of which initially represents a cluster mean or
center.
2. For each of the remaining objects, an object is assigned to a cluster to which it is the
most similar, based on the distance between the object and the cluster mean. It then
computes the new mean for each cluster using equation 4:
3.
ࡹ࢐ ൌ
૚
࢔࢐
෍ ࢆ࢖
ࢺࢆ࢖‫࡯א‬࢏
ሺ૝ሻ
Where, Mj is centroid of cluster j and nj is the number of data points in cluster j.
4. This process iterates until the criterion function converges.
Typically the square – error criterion is used, defined using equation 5:
ࡱ ൌ ෍ ෍|࢖ െ ࢓࢏| ሺ૞ሻ
࢖‫࡯א‬࢏
࢑
࢏ୀ૚
Where p is the data point and mi is the center for cluster Ci. E is the sum of squared
error of all points in dataset. The distance of criterion function is the Euclidean
distance which is used to calculate the distance between data point and cluster center.

208
The Euclidean distance between two vectors x = (x1, x2 , x3 , x4-------- xn) and y= (y1 , y2
, y3 , y4 ---- yn) can be calculated using equation 6:
ࢊሺ࢞࢏, ࢟࢏ሻ ൌ ෍ ඥሺ࢞࢏ െ ࢟࢏ሻ૛
࢔
࢏ୀ૚
ሺ૟ሻ
Algorithm: The k –Means algorithm for partitioning, where each cluster’s center is
represented by the mean value of the objects in the cluster.
Input:
• k: the number of clusters,
• D: a data set containing n objects.
Output: A set of k clusters.
Method:
1. arbitrarily choose k objects from D as the initial cluster centers;
2. repeat
3. (re)assign each object to the cluster to which the object is most similar, based
on the mean value of the objects in the cluster;
4. update the cluster means , i.e., calculate the mean value of objects for each
cluster;
5. until no change;
V. MODIFIED APPROACH I
The first approach discussed in [1] optimizes the Original k –Means algorithm by
proposing a method on how to choose initial clusters. The author proposed a method that
partitions the given input data space into k * k segments, where k is desired number of
clusters. After portioning the data space, frequency of each segment is calculated and highest
k frequency segments are chosen to represent initial clusters. If some parts are having same
frequency, the adjacent segments with the same least frequency are merged until we get the k
number of segments. Then initial centers are calculated by taking the mean of the data points
in respective segments to get the initial k centers. By this process we will get the initial which
are always same as compared to the Original k – Means algorithm which always selects
initial random centers for a given dataset.
Next, a threshold distance is calculated for each centroid is defined as distance
between each cluster centroid and for each centroid take the half of the minimum distance
from the remaining centroids. Threshold distance is denoted by dc(i) for the cluster C i .
To assign the data point to the cluster, take a point p in the dataset and calculate its
distance from the centroid of cluster i and compare it with dc(i) . If it is less than or equal to
dc(i) then assign the data point p to the cluster i else calculate its distance from other
centroids. This process is repeated until data point p is assigned to one of the cluster. If data
point p is not assigned to any of the cluster then the centroid which shows minimum distance
for the data point p becomes the centroid for that point. The centroid is then updated by
calculating the mean of the data points in the cluster.

209
Pseudo code for Modified k-Means algorithm is as follows:
Input: Dataset of N data points D (i = 1 to N)
Desired number of clusters = k
Output: N data points clustered into k clusters.
Steps:
1. Input the data set and value of k.
2. If the value of k is 1 then Exit.
3. Else
4. /*divide the data point space into k*k, means k Vertically and k horizontally*/
5. For each dimension{
6. Calculate the minimum and maximum value of data Points.
7. Calculate range of group(RG) using equation 7:
ࡾࡳ ൌ
ሺ࢓࢏࢔ ൅ ࢓ࢇ࢞ሻ
࢑
ሺૠሻ
8. Divide the data point space in k group with width RG
9. }
10. Calculate the frequency of data points in each partitioned space.
11. Choose the k highest frequency group.
12. Calculate the mean of selected group. /* These will be the initial centroids of k
clusters.*/
13. Calculate the distance between each clusters using equation 8:
ࢊ൫࡯࢏, ࡯࢐൯ ൌ ൛ࢊ൫࢓࢏, ࢓࢐൯: ሺ࢏, ࢐ሻ ‫א‬ ሾ૚, ࢑ሿ & ݅ ് ݆ൟ ሺૡሻ
Where d(C i, C j) is distance between centroid i and j
14. Take the minimum distance for each cluster and make it half using equation 9:
ࢊࢉሺ࢏ሻ ൌ
૚
૛
൛࢓࢏࢔ൣ ࢊ൫࡯࢏, ࡯࢐൯, … … … … … ൧ൟ ሺૢሻ
Where, dc(i) is half of the minimum distance of i th
cluster from other remaining
clusters.
15. For each data points Zp= 1 to N {
16. For each cluster j= 1 to k {
17. Calculate d(Zp,Mj) using equation 10:
ࢊሺ࢞࢏, ࢟࢏ሻ ൌ ∑ ሺ࢞࢏ െ ࢟࢏ሻ૛࢔
࢏ୀ૚ (10)
where d(xi,yi) is the distance between vector vectors x = (x1, x2 , x3 , x4-------- xn)
and y= (y1 , y2 , y3 , y4 ---- yn).
18. If (d(Zp,Mj)) ≤ dcj){
19. Then Zp assign to cluster Cj .
20. Break;
21. }
22. Else
23. Continue;

210
24. }
25. If Zp, does not belong to any cluster then
26. Zp, ϵ min(d(Zp, , Mi)) where iϵ [1, k]
27. }
28. Check the termination condition of algorithm if Satisfied
29. Exit.
30. Else
31. Calculate the centroid of cluster using equation 11:
ࡹ࢐ ൌ
૚
࢔࢐
෍ ࢆ࢖ ሺ૚૚ሻ
ࢺࢆ࢖‫࡯א‬࢐
Where Mj is centroid of cluster j and nj is the number of data points in cluster j.
32. Go to step 13.
VI. MODIFIED APPROACH II
In the work of [2], author calculate the distance of between each data points and select
that pair which show the minimum distance and remove it from actual dataset. Then took one
data point from data set and calculate the distance between selected initial point and data
point from data set and add with initial data point which show the minimum distance. Repeat
this process till threshold value achieved. If number of subsets formed is less than k then
again calculate the distance between each data point from the rest data set and repeat that
process till k cluster formed.
First phase is to determine initial centroids, for this compute the distance between
each data point and all other data points in the set D. Then find out the closest pair of data
points and form a set A1 consisting of these two data points, and delete them from the data
point set D. Then determine the data point which is closest to the set A1, add it to A1 and
delete it from D. Repeat this procedure until the number of elements in the set A1 reaches a
threshold. Then again form another data-point set A2. Repeat this till ’k’ such sets of data
points are obtained. Finally the initial centroids are obtained by averaging all the vectors in
each data-point set. The Euclidean distance is used for determining the closeness of each data
point to the cluster centroids
Next phase is to assign points to the clusters. Here the main idea is to set two simple
data structures to retain the labels of cluster and the distance of all the data objects to the
nearest cluster during the each iteration, that can be used in next iteration, we calculate the
distance between the current data object and the new cluster center, if the computed distance
is smaller than or equal to the distance to the old center, the data object stays in it’s cluster
that was assigned to in previous iteration. Therefore, there is no need to calculate the distance
from this data object to the other k- 1clustering center, saving the calculative time to the k-1
cluster centers. Otherwise, we must calculate the distance from the current data object to all k
cluster centers, and find the nearest cluster center and assign this point to the nearest cluster
center. And then we separately record the label of nearest cluster center and the distance to its
center. Because in each iteration some data points still remain in the original cluster, it means
that some parts of the data points will not be calculated, saving a total time of calculating the
distance, thereby enhancing the efficiency of the algorithm.

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May
Pseudo code for modified k- Means algorithm is as follows:
Input: Dataset D of N data points (i = 1 to N)
Output: N data points clustered into k clusters.
Phase 1:
Steps:
1. Set m = 1;
2. Compute the distance between each data point and all other data
set D using equation
where d(xi,yi) is the distance between vector
and y= (y1 , y2 , y3
3. Find the closest pair of data points from the set D and form a data
(1<= m <= k) which contains these two data
points from the set D;
4. Find the data point in D that is closest to the data point set A
and delete it from D;
5. Repeat step 4 until the number of data points in A
6. If m<k, then m = m+1, find another pair of data points from D between which
the distance is the shortest, form another data
from D, Go to step 4;
7. For each data-point set Am (1<=m<=k) find the arithmetic mean of the vect
of data points in A
Phase 2:
Steps:
1. Compute the distance of each data
(1<=j<=k) as d(di,
2. For each data-point
3. Set Cluster Id[i]=j; /* j: Id of the closest cluster for point i */.
4. Set Nearest _Dist[i]=
5. For each cluster j
6. Repeat
7. For each data-point d
a. Compute its distance from the centr
b. If this distance is less than or equal to the present nearest distance, the
data-point stays in the cluster;
c. Else for every centroid
d. End for;
8. Assign the data-point
9. Set ClusterId[i]=j;
10. Set Nearest_Dist[i] =
11. End for (step(2));
12. For each cluster j
criteria is met i.e. either no center updates or no point moves to
cluster.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976
6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME
211
Means algorithm is as follows:
Dataset D of N data points (i = 1 to N)
N data points clustered into k clusters.
Compute the distance between each data point and all other data-
set D using equation 12:
(12)
he distance between vector vectors x = (x1, x2 , x
3 , y4 ---- yn).
Find the closest pair of data points from the set D and form a data
= k) which contains these two data- points, Delete these two data
points from the set D;
Find the data point in D that is closest to the data point set Am, Add it to A
and delete it from D;
Repeat step 4 until the number of data points in Am reaches 0.75*
If m<k, then m = m+1, find another pair of data points from D between which
the distance is the shortest, form another data-point set Am and delete them
from D, Go to step 4;
point set Am (1<=m<=k) find the arithmetic mean of the vect
of data points in Am, these means will be the initial centroids
Compute the distance of each data-point di (1<=i<=N) to all the centroids C
, Cj) using equation (4.1)
point di, find the closest centroid Cj and assign di to cluster
Set Cluster Id[i]=j; /* j: Id of the closest cluster for point i */.
Set Nearest _Dist[i]= d(di, Cj);
j (1<=j<=k), recalculate the centroids;
point di,
Compute its distance from the centroid of the present nearest cluster;
If this distance is less than or equal to the present nearest distance, the
point stays in the cluster;
Else for every centroid cj (1<=j<=k) compute the distance
point di to the cluster with the nearest centroid Cj
Set ClusterId[i]=j;
Set Nearest_Dist[i] = d(di, Cj);
End for (step(2));
j (1<=j<=k), Recalculate the centroids until the convergence
criteria is met i.e. either no center updates or no point moves to
June (2013), © IAEME
- points in the
)
, x3 , x4-------- xn)
Find the closest pair of data points from the set D and form a data-point set Am
points, Delete these two data
, Add it to Am
reaches 0.75*(N/k);
If m<k, then m = m+1, find another pair of data points from D between which
point set Am and delete them
point set Am (1<=m<=k) find the arithmetic mean of the vectors
(1<=i<=N) to all the centroids Cj
to cluster j.
oid of the present nearest cluster;
If this distance is less than or equal to the present nearest distance, the
(1<=j<=k) compute the distance d(di, Cj);
(1<=j<=k), Recalculate the centroids until the convergence
criteria is met i.e. either no center updates or no point moves to another

212
VII. PROPOSED APPROACH
The changes are based on the selection of initial k centers and making the algorithm to
work dynamically, i.e. instead of running algorithms for different values of k, we try to make
algorithm in such a way that it itself decides how many clusters are there in a given dataset.
The two modifications are as follows:
• A method to select initial k centers.
• To make algorithm dynamic.
Proposed Algorithm:
The algorithm consists of three phases:
First phase is to determine the initial k centroids. In this phase user inputs the dataset and
value of k. The data space is divided into k*k segments as discussed in [1]. After dividing the
data space we choose the k segments with the highest frequency of points. If some parts are
having same frequency, the adjacent segments with the same least frequency are merged until
we get the k number of segments.
Then we find the distance of each point in each selected segment with the origin and these
distances are sorted for each segment and then middle point is selected as the center for that
segment. This step is repeated for each k selected segments. These represent the initial k
centroids.
Second phase is to assign points to the cluster based on the minimum distance between the
point and the cluster centers. The distance measure used is Euclidean distance. It then
computes the mean of the clusters formed as the next centers. This process is repeated until
no more center updates.
Third phase is where algorithm iterates dynamically to determine right number of clusters.
To choose the right number of clusters we use the concept the concept of Silhouette Validity
Index.
Pseudo Code for Proposed Algorithm:
Input: Dataset of N points.
Desired number of k clusters.
Output: N points grouped into k clusters.
Phase1: Finding Initial centroids
Steps:
1. Input the dataset and value of k ≥ 2.
2. Divide the data point set into k*k segments /*k vertically and k horizontally*/
3. For each dimension
{
4. Calculate the minimum and maximum value of data points.
5. Calculate the width (Rg) using equation 13:
ࡾࢍ ൌ
࢓࢏࢔ା࢓ࢇ࢞
࢑
ሺ૚૜ሻ
}
6. Calculate the frequency of data points in each segment.
7. Choose the k highest frequency segments.
8. For each segment i = 1 to k
{
9. For each point j in the segment i

213
{
10. Calculate the distance of point j with origin
}
11. Sort these distances in ascending order in matrix D
12. Select the middle point distance.
13. The co-ordinates corresponding to the distance in 12 is chosen as initial center
for the ith
cluster.
}
14. These k co-ordinates are stored in matrix C which represents the initial
centroids.
Phase2: Assigning points to the cluster
Steps:
1. Repeat
2. For each data point p = 1 to N
{
3. For each cluster j = 1 to k
{
4. Calculate distance between point p and cluster centroid cj of Cj using equation
14:
ࢊ൫࢖ , ࢉ࢐൯ ൌ ටሺ࢖ െ ࢉ࢐ሻ૛૛
(14)
}
}
5. Assign p to min{d(p,cj)}where j [1,k].
6. Check the termination condition of the algorithm if Satisfied
7. Exit
8. Else
9. Calculate the new centroids of cluster using 15:
ࢉ࢐ ൌ
૚
࢔࢐
∑ ࢖ ሺ૚૞ሻࢺ࢖‫࡯א‬࢐
Where nj is the number of points in cluster‫ܥ‬௝.
10. Go to step 1.
Phase3: To determine appropriate number of clusters
For the given value o the phase 1 and 2 are run for three iterations using k-2, k and
k +2. Three corresponding Silhouette values are calculated as discussed in section
2. These are denoted by Sk-2, Sk, Sk+2. The appropriate number of clusters is then
found using following steps.
Steps:
1. If Sk-2 < Sk and Sk > Sk+2 then run phase 1 and phase 2 using k+1 and k-1 and
corresponding Sk+1 and Sk-1 are found. The maximum of the three Sk-1, Sk, Sk+1
then determines the value of k as appropriate number of clusters. For example
if Sk+1 is maximum, then number of clusters formed by the algorithm is k+1.

214
2. Else If Sk+2 > Sk and Sk+2 > Sk-2 then run phase 1 and phase 2 using k+1, k+3,
k+4 and corresponding Sk+1, Sk+3 and Sk+4 are found. The k values
corresponding to maximum of the Sk+1, Sk+2, Sk+3, Sk+4 is returned.
3. Else If Sk+2 < Sk-2 and Sk < Sk-2 then run phase 1 and phase 2 using k-1, k-2, k-
3, k-4 and corresponding Sk-1, Sk-2, Sk-3 and Sk-4 are found. The k values
corresponding to maximum of the Sk-1, Sk-2, Sk-3, Sk-4 is returned.
4. Stop.
Thus the algorithm terminates itself where the best value of k is found. This
value of k shows appropriate number of clusters for a given data set.
VIII. RESULT ANALYSIS
The proposed algorithm is implemented and results are compared with that of
modified approach [1] and [2] in terms of execution time and initial centers chosen.
1. The total time taken by the algorithm to form clusters and dynamically
determining the appropriate number of clusters is actually less than the total
time taken by the algorithm [1] to run for different values of k.
For example if we run algorithm in [1] for different values of k such as k = 2,
3, 4, 5, 6, 7, etc. The algorithm in [1] takes more time as compared to the
proposed algorithm which itself runs for different values of k.
2. We define new method to determine initial centers that is based on middle
value as compared to mean value. The reason behind this is that the middle
value best represents the distribution and moreover as mean is influenced by
too large and too small values, the middle value is not affected by this.
The results show that algorithm works dynamically and is also an improvement over
original k-Means. Table I shows results of running algorithm in [1] over wine dataset from
UCI repository for k = 3, 4, 5, 6, 7 and 9. The algorithm is run for these values of k because
in proposed algorithm we initially fed k =7 and algorithm runs for these values of k
automatically and so total execution time of both algorithms are compared. And results shows
that proposed algorithm take less time than running the algorithm [1] individually for
different values of k.
TABLE I: RESULTS OF ALGORITHM [1] FOR WINE DATASET
Sr. no. Value of k Silhouette validity Index Execution time (s)
1. 3 0.4437 3.92
2. 4 0.3691 4.98
3. 5 0.3223 2.89
4. 6 0.2923 7.51
5. 7 0.2712 3.56
6. 9 0.2082 11.96
TOTAL EXECUTION TIME 34.82
The results for the proposed algorithm show different runs and stops at:
maximum silhouette value = 0.443641 for best value of k = 3
Elapsed time is 30.551549 seconds.

215
Thus, from results it is clear that algorithm stops itself when it finds the right number
of clusters. It shows that proposed algorithm takes less time as compared to algorithm in [1].
Table II shows results of execution time for both algorithms. It also shows initial value fed to
our algorithm and where the best value of k is found the algorithm stops. Experiments are
performed on random datasets of 50, 178 and 500 points.
TABLE II COMPARING RESULTS OF PROPOSED ALGORITHM &
ALGORITHM IN [1]
Sr.
no.
Dataset Initial value of
k
Best value of
k
Proposed
algorithm time (s)
Algorithm [1]
time (s)
1. 50 points 4 6 18.6084 28.0096
2. 178
points
9 10 39.1941 50.2726
3. 500
points
5 4 66.6134 91.9400
Table III shows comparison results between original k-means, modified approach II
and proposed algorithm. When comparing execution times of proposed algorithm with other
algorithms, it is seen that proposed algorithm takes much less time than the original k-Means
and for the large dataset such as dataset of 500 points; the proposed algorithm also
outperforms the modified approach II.
TABLE- III EXECUTION TIME(s) COMPARISON
Sr. No. Dataset Original k-
Means
Modified approach
II
Proposed algorithm
1. 50 points 15.1727 11.4979 18.6084
2. 178 points 74.5168 21.6497 39.1941
3. 500 points 86.7619 87.2461 66.6134
From all the results we can conclude that although procedure of proposed algorithm is long
but it prevents user from running the algorithm for different values of k as in other three
algorithms discussed in previous chapters. The proposed algorithm dynamically iterates and
stops at best value of k.
Figure I –III shows different silhouette plots for all 3 datasets of random points
discussed above depicting how close a point to the other members of its own cluster is. The
plot also shows that if any point is not placed incorrect cluster if the silhouette index value for
that point is negative. Figure IV shows execution time comparison of all the algorithms
discussed in the paper.

216
Figure 1 Silhouette plot for 50 points

217
Figure 4 Execution time (s) comparison of different algorithms
Table 4 shows comparison between all four algorithms discussed in paper on the basis of
experiments results.
TABLE 1V: COMPARISON OF ALGORITHMS
Parameters Original k-
Means
Modified Approach I Modified Approach II Proposed Algorithm
Initial
Centers
Always random
and thus
different
clusters for
same value of k
for given data.
Way to select initial
centers is fixed by
always choosing initial
centers in the highest
frequency segment.
Selection of initial
centers by always
choosing points based
on the similarity
between points.
Initial centers are fixed
by choosing the centers
in the highest frequency
segment, which is
middle point of that
segment points
calculated from origin.
Redundant
Data
Can work with
data
redundancies.
Suitable. Not suitable for data
with redundant points.
Can work with
redundant data.
Dead Unit
Problem
Yes. No. No. No.
Value of k Fixed input
parameter.
Fixed input parameter Fixed input parameter Initial value given as
input, algorithm
dynamically iterates
and determines best
value of k for given
data.
Execution
Time
More. Less as compared to
Original k-Means, but
more than other two
algorithms
Less than Original k-
Means and Modified
Approach I but more
than Proposed
Algorithm.
Less than all other three
algorithms.

218
IX. CONCLUSION
In this paper we presented different approaches to k-Means clustering that are
concluded using comparative presentation and stressing their pros and cons. Another issue
discussed in the paper is clustering validity which we measured using silhouette validity
index. For a given dataset this index shows which value of k produces compact and well
separated clusters. The paper presents a new method for selecting initial centers and dynamic
approach to k-Means clustering so that user needs not to check the clusters for different
values of k. Instead inputs initial value of k and algorithm stops after it finds best value of k
i.e. algorithm stops when it attains maximum silhouette value. Experiments also show that the
proposed dynamic algorithm takes much less computation time then the other three
algorithms discussed in the paper.
X. ACKNOWLEDGEMENTS
I am eternally grateful to my research supervisor Dr. M.P.S Bhatia for their
invigorating support and valuable suggestions and guidance. I thank him for his supervision
and patiently correcting me with my work. It is a great experience and I gained a lot here.
Finally, we are thankful to the almighty God who had given us the power, good sense and
confidence to complete my research analysis successfully. I also thank my parents and my
friends who were a constant source of encouragement. I would also like to thanks Navneet
Singh for his appreciation.
REFERENCES
Proceedings Papers
[1] Ran Vijay Singh and M.P.S Bhatia, “Data Clustering with Modified K-means
Algorithm”, IEEE International Conference on Recent Trends in Information Technology,
ICRTIT 2011, pp 717-721.
[2] D. Napoleon and P. Ganga lakshmi, “An Efficient K-Means Clustering Algorithm for
Reducing Time Complexity using Uniform Distribution Data Points”, IEEE 2010.
Journal Papers
[3] Tajunisha and Saravanan, “Performance Analysis of k-means with different
initialization methods for high dimensional data” International Journal of Artificial
Intelligence & Applications (IJAIA), Vol.1, No.4, October 2010
[4] Neha Aggarwal and Kriti Aggarwal,”A Mid- point based k –mean Clustering
Algorithm for Data Mining”. International Journal on Computer Science and Engineering
(IJCSE) 2012.
[5] Barileé Barisi Baridam,” More work on k-means Clustering algortithm: The
Dimensionality Problem ”. International Journal of Computer Applications (0975 –
8887)Volume 44– No.2, April 2012.
Proceedings Papers
[6] Shi Na, Li Xumin, Guan Yong “Research on K-means clustering algorithm”. Proc of
Third International symposium on Intelligent Information Technology and Security
Informatics, IEEE 2010.

219
[7] Ahamad Shafeeq and Hareesha ”Dynamic clustering of data with modified K-mean
algorithm”, Proc. International Conference on Information and Computer Networks
(ICICN 2012) IPCSIT vol. 27 (2012) © (2012) IACSIT Press, Singapore 2012.
Research
[8] Kohei Arai,Ali Ridho Barakbah, Hierarchical K-means: an algorithm for centroids
initialization for k-Means. Reports of the faculty of Science and Engineering, Saga
University, Vol. 26, No. 1, 2007.
Books
[1] Jiawei Han and Micheline Kamber, data mining concepts and techniques (Second
Edition).

Dynamic approach to k means clustering algorithm-2

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Dynamic approach to k means clustering algorithm-2

Similaire à Dynamic approach to k means clustering algorithm-2 (20)

Plus de IAEME Publication

Plus de IAEME Publication (20)

Dernier

Dernier (20)

Dynamic approach to k means clustering algorithm-2