1. Joumal of Classificanon 5:237-247 (1988)
James C. Bezdek
Boeing ElecEonics
II
I
I
I
Recent Convergence Results for the Fuzzy c-Means
Clustering Algorithms
Richard J. Hathaway
Georgia Southern College
Abstract: One of the main techniques embodied in many pattem recognition sys-
tems is cluster analysis
-
the identification of substructure in unlabeled data sets.
Ttrc fuzzy c-means algorithms (FCM) have often been used to solve certain types
of clustering problems. During the last two years several new local results con-
cerning both numerical and stochastic convergence of FCM have been found.
Numerical resuls describe how the algorithms behave when evaluated as optimiza-
tion algorithms for fnding minima of the corresponding family ol fuzzy c-means
functionals. Stochastic properties refer to the accuracy of minima of FCM func-
tionals as al4rroximations to parameters of statistical populations which are some-
times assumed to be associated with the data. The purpose of this paper is to col-
lect the main global and local, numerical and stochastic, convergence results for
FCM in a brief and unified way.
Keywords: Cluster analysis; Convergence; Fuzzy c-mears algorithm; Optimiza-
tion; Partitioning Algorithms; Pattern recognition.
1. Introduction
The purpose of this paper is to gather and discuss some recent conver-
gence results regarding the Fuzzy c-Means (FCM) clustering atgorithms.
Authors' Addresses: Richard J. Hathaway, Mathematics and Computer Science Depart-
men! Georgia Southern College, Statesboro, Georgia 30460, USA and James C. Bezdeh
Information Processing Lab., Boeing Electronics, PO Box 24969, Seattle, Washington 98124-
6269, USA.
]I
i
I
I
2. 238 R. J. Hathaway and J. C. Bezdek
These algorithms are quite useful in solving certain kinds of clustering prob-
lems where a population X of n objects, each represented by some vector of s
numerical features or measurements x e R", is to be decomposed into subpo-
pulations (or clusters) of similar objects. The FCM algorithms use the set of
feature vectors, along with some initial guess about the cluster substructure,
to obtain a partitioning of the objects into fuzzy clusters, and as a by-product
ofthe partitioning procedure, produce a prototypical feature vector represent-
ing each subpopulation. FCM is known to produce reasonable partitionings
of the original data in many cases. (See Bezdek 1981, for many illustrative
examples.) Furthermore, the algorithm is known to produce partitionings
very quickly compared to some other approaches. For example, Bezdek,
Hathaway, Davenport and Glynn (1985) have shown that FCM is, on average,
perhaps an order of magnitude faster than the maximum likelihood approach
to estimation of the parameters of a mixture of two univariate normal distribu-
tions. These facts justify the study of convergence properties of the fuzzy c-
means algorithms, which have become much better understood during the last
two years.
In this note we wish to survey convergence theory for the FCM algo-
rithms on two different levels. First of all, the algorithms are iterative optimi-
zalson schemes tailored to find minima of a corresponding family of fuzzy c-
means functionals. Our first look at convergence results will concem numeri-
cal convergence properties of the algorithms (or equivalently, of the sequence
of iterates produced by the algorithms): how well do they attain the minima
they were designed to find. This type of theory is referred to herein as numer-
ical convergence theory, and is concemed with questions like how fast the
iterates converge to a minimum of the appropriate functional, or whether they
converge at all. These properties are discussed in Section 3, where both
theoretical and empirical studies are cited.
The second type of convergence theory examined is referred to herein
as stochastic convergence theory. It concems a completely different kind of
question: how accurate are the minima (that the fuzzy c-means algorithms try
to find) in representing the actual cluster substructure of a sample. The per-
tinent theoretical result cited in Section 4 regards the statistical concept of
consistency. Some additional light can be shed on stochastic convergence
properties by considering the results of empirical tests, also contained in Sec-
tion 4.
It is clear that both types of convergence results are useful in interpret-
ing final partitionings produced by the FCM algorithms. The algorithms are
introduced in the next section; the final section contains a discussion and
topics for further research.
3. Recent Convergence Results forFuzzy c-Means 239
2. The FCM Algorithms
Let c>2 be an integer; let X = {xr,...,xn}cR" be a finite data set
containing at least c < n distinct points; and let Rc" denote the set of all real
c x n matrices. A nondegenerate fuzzy c-partition of X is conveniently
represented by a matrix U = Iuiil G R"n, the entries of which satisfy
u;1, e f0,lf 1 <i < c; t < k 3n, (1a)
(1b)
(1c)
1; l<k1n,
u;p)Oi l<i<c,
The set of all matrices in R"n satisfying (1) is denoted by M7"o. A
matrix U e My,n can be used to describe the cluster structure of X by inter-
preting u;p as the grade of membership of x1 in cluster i: u;p =.95 represents a
strong association of x1 to cluster i, while uit = .01 represents a very weak
one. Note that Mrn, the subset of M7"n which contains only matrices with all
r;1's in {0,1}, is exactly the set of non-degenerate crisp (or conventional) c-
partitions of X. Other useful information about cluster substructure can be
conveyed by identifying prototypes (or cluster centen)
v = (v1, . . . ,v")T € R'", where v; is the prototype for class i,
1 < i < c, v; € R". "Good" partitions U of X and representatives (v; forclass
l) may be defined by considering minimization of one of the family of c-
means objective functionals J^: (Myrn x R"") + R defined by
"/-(U,v) = | | (ui)^ ll ** - "; ll2 ,
t=1 i=1
where L <m<- and ll'll is any inner product induced nolm on R". This
approach was first given for rn = 2 in Dunn (1973) and then generalized to the
above range ofvalues ofn in Bezdek (1973). For m > 1, Bezdek (1973) gave
the following necessary conditions for a minimum (U*,v*) of ,I-(U,v) over
My", R"t .
("1) =
L @ir)^ *r
k=l
for all i ; (3a)
L @iD^
k=l
c
2u;r=
i =l
n
k=1
(2)
4. 240 R. J. Hathaway and J. C. Bezdek
and for each & such that dit =ll x* - vl ll' > 0 V i, rhen
,ir =
l?^"r,r)-'
ror an i , (3b1)
where
I
ai11, = fdip t d];{^-t> ,
but if ftis such that diz = ll xr -"1 ll' = 0 forsome r, then zL V i are any non-
negative numbers satisfying
I and ui* = Oit dL*O . (3b2)
The FCM algorithms consist of iterations altemating between equa-
tions (3a) and (3b). The process is started either with an initial guess for the
partitioning U or an initial guess for the prototype vectors v, and is continued
until successive iterates of the partitioning matrix barely differ; that is, itera-
tion stops with the fint U'+l such that ll U'*t - U' ll< q where e is a small
positive number. The numerical convergence results which follow concem
the behavior of the sequences {U'} and {v'}, while the stochastic theory
refers to how well minima of (2) actually represent the cluster substructure of
a population under certain statistical assumptions.
3. Numerical Convergence Properties
3.1 Theory
The properties of sequences of iterates produced by optimization algo-
rithms can be divided into two different kinds of results. Global results refer
to properties which hold for every iteration sequence produced by the algo-
rithm, regardless of what the initial iterate was; whereas local convergence
properties refer to the behavior ofsequences ofiterates produced by the algo-
rithm when the initial iterate is "sufficienfly close" to an actual solution (in
this case a local minimum of J^'in (2)). As a simple example, local conver-
gence results for Newton's method show it to be quadratically convergent in
most cases, after it has gonen sufficiently close to the solution, while the glo-
bal convergence theory for Newton's method without modification is weak;
the algorithm readily fails to converge (at any rate) when started from a
sufficiently poor initial guess.
iuir=i=l
5. Recent Convergence Results forFuzzy c-Means
Early FCM convergence results were of the global type. In Bezdek
(1980), it was claimed that iterates produced by FCM always converged, at
least along a subsequence, to a minimum of (2). The proof utilized the con-
vergence theory in Zangwill (1969), but incorrectly identified the set of possi-
ble limit points as consisting only of minima. This original theorem was
clearly identified as being incorrect by particular counterexamples found by
Tircker (1987). The corrected global result given below is taken from Hatha-
way, Bezdek and Tbcker (1987):
A. Global Convergence Theorem for FCM. Let the sample X contain at
Ieast c < n distinct points, and let (U0,"0) be any starting point in M6oXRcs
for the FCM iteration sequence {(U',v')}, r =1,2,... ff(U*,v*) is any timit
point of the sequence, then it is either a minimurn or saddle point of (2).
Note: it is worth re-emphasizing that Theorem A is called a global conver-
gence theorem because convergence to a minimum or saddle point occurs
from any initialization; when convergence is to a minimum, it may be either a
local or (the) global minimum of J^.
Local results for FCM are very recent. The following result, taken
from Hathaway and Bezdek (1986a), was the first local convergence property
derived for FCM.
B. Local Convergence Theorem for FCM. Let the sample X contain at
least c < n distinct points, and (lJ* ,v*) be any minimum of (2) such that
di* > O for all i,k, and at which the Hessian of J^ is positive definite relative
to all feasible directions. Then FCM is locally convergent fo 1U*,v*).
This theorem guarantees that when FCM iteration is started close
enough to a minimum of J^, the ensuing sequence is guaranteed to converge
to that particular minimum. The last theorem, from Bezdek, Hathaway,
Howard and Wilson (1987), regards the rate of local convergence.
C. Local Convergence Rate Theorem for FCM. Izt the sarnple X contain
at least c < n distinct points, and (IJ*,v*) be any minimrm of (2) such that
dit > AY i,k, and atwhich the Hessian of J^ is positive definite relative to all
feasible directions. f {(U',v')} is an FCM sequence converging /o (U*,v*),
then the sequence converges linearly to the solution; that is, there is a
number 0 < l" < I and normll.llsrclr thatfor all suftciently large r,
il (u'*l,v'*l)- (u*,v*) ll< l"ll (u',v') - (u*,v.) ll.
24t
6. 242 R. J. Hathaway and J. C. Bezdek
Note: The number l, in Theorem C is equal to the spectral radius of a matrix
obtained from the Hessian of J- which is exhibited in Bezdek, Hathaway,
Howard and Wilson (1987).
To summarize the numerical convergence theory for FCM, the algo-
rithm is globally convergent, at least along subsequences, to minima, or at
worst saddle points, of the FCM functionals in (2). Additionally, the algo-
rithm is locally, linearly convergent to (ocal) minima of J^. The following
result conceming tests for optimality is taken from Kim, Bezdek and Hatha-
way (1987). In the statement of the result, H(fD is the cn x cn Hessian matrix
of the function"f(tD=min {"I-(U,v) lve R""} and P=I-(l/c)K, where I
is the cn x cn identity matrix and K is the cn x cn block-diagonal matrix with
n(c xc) diagonal blocks ofall 1's.
D. Optimality Tests Theorem for FCM. At termination of the FCM algo-
rithm, f (U*,v*) is a local minimum of the objective function J^(IJ,v), then
PH(U-)P is positive semidefinite.
Note that Theorem D gives a necessary but not sufficient condition for a
local minimum. The importance of Theorem D is due to the fact that efficient
algorithms exist for checking whether PH(U.)P is positive semidefinite. (See
Kim, Bezdek and Hathaway 1987, for implementing an optimality test based
on Theorem D.) Other recent work conceming numerical convergence and
the testing of points for optimality is given by Ismail and Selim (1986), and
Selim and Ismail (1986).
3.2 Empirical Observations
While theoretical results are important in understanding the FCM clus-
tering algorithms, they do not by themselves indicate exactly how effective
the algorithms are in finding minima of (2). For example, knowing only that
an algorithm is linearly convergent locally does not guarantee convergence
will occur quickly enough to be useful; many linearly convergent algorithms
are of little practical utility exactly because they converge too slowly. Much
numerical testing has been done in order to leam more about the effectiveness
of FCM in flnding minima of ,I- (Hathaway, Huggins and Bezdek 1984;Bez-
dek, Hathaway and Huggins 1985; Bezdek, Davenport, Hathaway and Glynn
1985; Davenport, Bezdek and Hathaway 1987). Some of the general results
found in those empirical tests are discussed below.
The FCM algorithms converge very quickly to optimal points for (2).
The simulations done in the four papers mentioned above all involved genera-
tion of data from a known mixture of normal distributions, so that each
7. Recent Convergence Results forFuzzy c-Means 243
subpopulation was in fact normally distributed. The FCM approach (with
m = 2) was used to decompose the unlabeled total population into its nor-
mally distributed component subpopulations. This approach was compared
with parametric estimation using Wolfe's EM algorithm based on the max-
imum likelihood principle to find extrema of the likelihood function
corresponding to a mixture of normal distributions (Wolfe l97O). The FCM
approach almost always converged within 10 to 20 iterations, while the
widely used EM algorithm took hundreds, and in some cases over a thousand,
iterations to converge. This empirical result is even more signiflcant when
we note that each iteration of FCM is relatively inexpensive computationally
compared to approaches such as EM.
In addition to being fast, these numerical simulations indicate that the
FCM approach is relatively independent of initialization. It is not the case,
however, that termination, which is relatively independent of the initial guess,
usually occurs at the global minimum of J^. Rather, in this instance a local
minimum often dominates convergence (presumably because it identifies
truly distinctive substructure). Although no comprehensive study has been
done regarding whether terminal points of FCM are usually minima or saddle
points of (2), in our experience convergence to a saddle point for other than
contrived data happens very rarely, if ever, in practice. No Monte Carlo type
simulation studies have been conducted to date conceming the percentages of
runs that terminate at each type of extremum (ocal minimum, saddle point,
global minimum). Indeed, it is not clear how one determines the global
minimum needed to conduct such studies except for trivially small artificial
data sets. Further, Bezdek (1973) exhibits an example in which the global
minimum is less attractive (visually) than local minima of J^ for m> l.
Nonetheless, this would constitute an interesting and useful numerical experi-
ment for a future study. Next, the question of (statistical) accuracy under the
mixture assumption is discussed.
4. Stochastic Convergence Properties
4.l Theory
In order to construct a statistical theory conceming the accuracy of par-
titionings and cluster prototypes produced by FCM, it is necessary to impose
a statistical framework by which estimator accuracy can be measured. Other-
wise, there are only specific examples from which general conclusions abottt
partitioning quality cannot be easily drawn. There are certainly 2-
dimensional examples where FCM has done a good job (visually) of
representing the cluster substructure of a population, and other cases where
FCM has done a poor job. Indeed, this is the case for all clustering
8. 244 R. J. Hathaway and J. C. Bezdek
algorithms. The sole theoretical result in this context, taken from Hathaway
and Bezdek (1986b) is somewhat negative.
E. Consistency Theorem. Let p(y:at,sa) = o4pr0) + azpz(!), where p1
andp2 are symmetic densities with respective centers (means) of 0 and 1,
and suppose that the expected values of lYlz taken with respect to the com-
ponent distributions are finite. Then there exist subpopulation proportions
a1 and a2 such that the FCM cluster centers v1 and v2 for m = 2 are not
consistent for the true component distribution centers 0 and I of p(y:at,so).
The statistical concept ofconsistency refers to the accuracy ofthe pro-
cedure as it is given increasing information (in this case through more and
more members of the population of objects to be clustered). The theorem
states that even when it is possible to obsewe an infinite number of members
of the population to be clustered, the FCM approach has limited accuracy in
being able to determine the true centers (means) of the component popula-
tions. This result is not particularly surprising, however, because the FCM
functionals in (2) are not based on statistical assumptions; that is, minimizing
,I- does not optimize any principle of statistical inference, such as maximiz-
ing a likelihood functional. Note that Theorem E implies a limitation on the
accuracy obtainable for any type of symmetric component distributions, and
this limitation in accuracy would actually be observable given sequences of
larger and larger samples and sufficiently accurate calculation of the FCM
prototypes. It is reasonable to conjecture that the asymptotic accuracy
depends on such things as the amount of component separation and types of
components; but no theoretical work has been done on this. @mpirical
findings regarding the accuracy are given in the next section.)
This section is ended by noting that result E resolves a longstanding
question about whether FCM does in fact provide consistent estimators for
normal mixtures (cf. Bezdek 1981). As with numerical convergence, the
above theory only provides partial understanding about FCM partitioning. As
in Section 3, we supplement this by the following results of numerical experi-
ments.
4.2 Empirical Observations
To assess empirically the accuracy of FCM in one particular case (a
mixture of c = 2 univariate, normally distributed subpopulations), Monte
Carlo simulations, which are discussed in the four references given in Section
3.2,were conducted to compare the FCM approach to Wolfe's (1970) method
of maximum likelihood specific to the true family of distributions used to
genente the feature vector data.
9. Recent Convergence Results forFuzzy c-Means 245
As in the case of numerical convergence properties, the observed
behavior of the approach was as good or better than ttrat indicated by the
theory. Not only did FCM produce cluster substructure estimates faster than
the maximum-likelihood method, but in most cases the estimates were at least
as accurate. Only when the component centers got very close did the max-
imum likelihood approach become clearly superior to that based on FCM.
Rougtrly speaking, if there is enough separation of component distributions to
create multimodality of the corresponding mixture density, then FCM has a
"reasonable" chance of producing estimates which are at least as accurate as
those obtained by maximum likelihood. It must be kept in mind ttrat FCM is
nonparametric in that it does not assume any particular form for the underly-
ing distributions, while the maximum likelihood method relies heavily on the
(conect) assumption that each component population is normally distributed.
The motivation for this study is simple: FCM is less computationally demand-
ing than maximum likelihood in terms of both time and space complexity. It
should be noted that several comparisons of FCM to Hard c-Means (HCM) or
Basic ISODATA are discussed in Bezdek (1981): FCM substantially extends
the utility of HCM through the expedient of allowing overlapping clusters.
5. Discussion
The FCM approach has proven to be very effective for solving many
cluster analysis problems; and the behavior of the FCM approach in practice
is well documented. There are, of course, many substantial unanswered ques-
tions about FCM. For example, the choice of (m) in (2). This parameter in
some sense controls the extent to which U is fuzzy. As m approaches one
from above, optimal U's for,I- approach M"n; conversely, as 4 , e, aye;r!
u;1, at (3b) approaches (1/c). Moreover, interpretation of the numbers {r;1} is
itself controversial; to what extent do these numbers really assess a "degree
of belongingness" to different subpopulations? Aspects of FCM such as
these are further discussed in Bezdek (1981). It is clear that much more can
be leamed about the stochastic convergence theory of FCM, but it is probably
true that numerical aspects of these algorithms are currently well understood.
Interesting questions remain conceming the saddle points of (2). First,
how often will FCM converge to saddle points rather than minima? Is this a
problem in practice, or just a necessary theoretical consideration? Are there
ever cases when saddle points of (2) do a good job of representing the struc-
ture of the population? Another line of research involves extension of the
results collected above to the more general fuzzy c-vaieties functionals dis-
cussed in Bezdek (1981): which, if any of the results above carry over to the
more general setting? We hope to make these questions subjects of future
reports. Readers interested in obtaining listings and/or computer programs
10. 246 R. J. Hathaway and J. C. Bezdek
for FCM in BASIC, PASCAL, FORTRAN or C may contact either author at
their listed addresses.
References
BEZDEK, J. (1973), "Fuzzy Mathematics in Pattern Classification," Ph.D. dissertation, Cor-
nell University, Ithaca, New York.
BEZDEK, J. (1980), "A Convergence Theorem for the Fuzzy ISODATA Clustering Algo-
rithms," Institute of Electrical and Electronic Engineers Transactions on Pattern
Arnlysis and Machine lruelligence, 2,1-8.
BEZDEK, J. (1981), Pattern Recognition with Fuzzy Objective Function Algorithnts, New
York: Plenum Press.
BEZDEK, J., DAVENPORT, J., HATHAWAX R., and CLYNN, T. (1985), "A Comparison of
the Fuzzy c-Means and EM Algorithms on Mixture Distributions with Different Levels
of Component Overlapping," n The Proceedings of the 1985 IEEE Worl<shop on
Languages for Autonntion: Cognitive Aspects in Inforrnation Processing, ed. S. K.
Chang, Silver Spring, Maryland: Institute of Electrical and Electronic Engineers Com-
puter Society kess, 98-102.
BEZDEK, J., HATHAWAX R., HOWARD, R., WILSON, C., and WINDHAM, M. (1987),
"t ocal Convergence Analysis of a Grouped Variable Version of Coordinate Descent,"
Iournal of Optimization Theory and Applicatior*, 54,471-477.
BEZDEK, J., HATHAWAY, R., and HUGGINS, V. (1985), "Parametric Estimation for Nor-
mal Mixtures," Pattern Recognition Letters, 3,79-84.
BEZDEK, J., HAIHAWAY, R., SABIN, M., and TUCKER, W. (1987), "Convergence Theory
for Frzzy c-Means: Counterexamples and Repairs," Institute of Electrical and Elec-
tronic Engineers Transactions on Systerns, Man and Cybernetics, 17,873-877.
DAVENPORT, J., BEZDEK, J., and HAIHAWAY, R. (1988), "Parameter Estimation for a
Mixture of Distributions Using Fuzzy c-Means and Constrained Wolfe Algorithms,"
Journal of Compfters and. Mathematics with Applications, 15,819-828.
DUNN, J. (1973), "AFuzzy Relative of the ISODATA kocess and Its Use in Detecting Com-
pacg Well-Separated Clusters," f ourral of Cyberrutic s, 3, 32-57 .
HATHAWAY, R., and BEZDEK, J. (1986a), "[,ocal Convergence of the Fuzzy c-Means Algo-
rithms," Paftern Recognition, I 9, 477 -480.
HAIHAWAY, R., and BEZDEK, J. (1986b), "On the Asymptotic hoperties of Ftzzy c-Means
Cluster Prototl'pes as Estimators of Mixture Subpopulation Centers," Conununications
in Statistics: Theory and Methods,15, 505-513.
HATHAWAY, R., BEZDEK, J., and TUCKER, W. (1987), "An Improved Convergence
Theory for the Fuzzy ISODATA Clustering Algorithms," tn Analysis of Fuzzy Informa-
tioq ed. J. C. Bezdeh Volume 3, Boca Raton: CRC Press, 123-132.
HAIHAWAX R., HUGGINS, V., and BEZDEK, J. (1984), "A Comparison of Methods for l
Computing Parameter Estimates for a Mixture of Normal Distributions, " in Proceedings "
of the Fifteenth Awwal Pittsburgh Conference on Modeling and Simulations, ed. E.
Casetti, Research Triangle Park, NC: ISA, 1853-1860.
ISMAIL, M., and SELIM, S. (1984), "Fnzzy c-Means: Optimality of Solutions and Effective
Termination of the Algorithm," P att ern Reco gnition, I 9, 481 -485.
KIM, T., BEZDEK, J., and HATHAWAY R. (1987), "Optimality Test for Fixed Poins of ttre
FCM Algorithms," Pattern Recognition (in press).
11. Recent Convergence Results forFuzzy c-Means 247
SELM, S., and ISMAIL, M. (1986), "On the Local Optimality of 0re Fuzry ISODATA Clus-
tering Algorithm," Institute of Electrbal and Electronic Engineers Transactiora on Pat-
tern Analysis ard Machine lrtelligence, 8, 2U-288.
TUCKER, W. (1987), "Counterexamples to the Convergence Theorem for Fuzzy ISODATA
Clustering Algorithms," n Analysis of Fuzzy Inforrnation, ed. J. Bezdelq Volume 3,
Boca Ra&on: CRC Press, lW-122.
WOLFE, J. H. (1970), "Pattern Clustering by Multivariate Mixnre Analysis," Multivariate
B ehov ioral R e search, 5, 329 -350.
ZANGWILL, W. (1969), Non-Linear Prograntning: A Unifed Approach, Englewood Clift,
NJ: Prentice Hall.