Divergence-based clustering and applications of total Jensen divergences

Divergence-based center clustering and their
applications
Frank Nielsen
´Ecole Polytechnique
Sony Computer Science Laboratories, Inc
ICMS International Center for Mathematical Sciences
Edinburgh, Sep. 21-25, 2015
Computational information geometry for image and signal processing
c 2015 Frank Nielsen 1

Center-based clustering [12]: Setting up the context
Countless applications of clustering: quantization (coding), ﬁnding
categories (unsupervised-clustering), technique for speeding-up
computations (e.g., distances), and so on.
Minimize objective/energy/loss function:
E(X = {x1, ..., xk}; C = {c1, ..., ck}) = min
C
n
i=1
min
j∈[k]
D(xi : cj )
Initialize k cluster centers (seeds): random (Forgy), global
k-means (discrete k-means), randomized k-means++
(expected guarantee ˜O(log k))
Famous heuristics: Lloyd’s batched allocation
(assignment/center relocation), Hartigan’s single point
reassignment. Guarantees monotone convergence
variational k-means: When centroids arg min n
i=1 D(xi : c)
not in closed form, center relocation just need to be better
(not best) to still guarantee monotone convergence

The trick of mixed
divergences [13, 12]: Dual
centroids per cluster

Mixed divergences [12]
Deﬁned on three parameters p, q and r:
Mλ(p : q : r)
eq
= λD(p : q) + (1 − λ)D(q : r)
for λ ∈ [0, 1].
Mixed divergences include:
the sided divergences for λ ∈ {0, 1},
the symmetrized (arithmetic mean) divergence for λ = 1
2, or
skew symmetrized for λ ∈ (0, 1), λ = 1
2.

Symmetrizing α-divergences
Sα(p, q) =
1
2
(Dα(p : q) + Dα(q : p)) = S−α(p, q),
= M1
2
(p : q : p),
For α = ±1, we get half of Jeﬀreys divergence:
S±1(p, q) =
1
2
d
i=1
(pi
− qi
) log
pi
qi
same formula for probability/positive measures.
Centroids for symmetrized α-divergence usually not in closed
form.
How to perform center-based clustering without closed form
centroids?

Closed-form formula for Jeffreys positive centroid [7]
Jeffreys divergence is symmetrized α = ±1 divergences.
The Jeffreys positive centroid c = (c1, ..., cd ) of a set
{h1, ..., hn} of n weighted positive histograms with d bins can
be calculated component-wise exactly using the Lambert W
analytic function:
ci
=
ai
W ai
gi e
where ai = n
j=1 πj hi
j denotes the coordinate-wise arithmetic
weighted means and gi = n
j=1(hi
j )πj the coordinate-wise
geometric weighted means.
The Lambert analytic function W (positive branch) is defined
by W (x)eW (x) = x for x ≥ 0.
→ Jeffreys k-means clustering . But for α = 1, how to
cluster?

Mixed α-divergences/α-Jeffreys symmetrized divergence
Mixed α-divergence between a histogram x to two
histograms p and q:
Mλ,α(p : x : q) = λDα(p : x) + (1 − λ)Dα(x : q),
= λD−α(x : p) + (1 − λ)D−α(q : x),
= M1−λ,−α(q : x : p),
α-Jeffreys symmetrized divergence is obtained for λ = 1
2:
Sα(p, q) = M1
2
,α(q : p : q) = M1
2
,α(p : q : p)
skew symmetrized α-divergence is defined by:
Sλ,α(p : q) = λDα(p : q) + (1 − λ)Dα(q : p)

Mixed divergence-based k-means clustering
Initially, k distinct seeds from the dataset with li = ri .
Input: Weighted histogram set H, divergence D(·, ·), integer
k > 0, real λ ∈ [0, 1];
Initialize left-sided/right-sided seeds C = {(li , ri )}k
i=1;
repeat
// Assignment (as usual)
for i = 1, 2, ..., k do
Ci ← {h ∈ H : i = arg minj Mλ(lj : h : rj )};
end
// Dual-sided centroid relocation (the trick!)
for i = 1, 2, ..., k do
ri ← arg minx D(Ci : x) = h∈Ci
wj D(h : x);
li ← arg minx D(x : Ci ) = h∈Ci
wj D(x : h);
end
until convergence;

Mixed α-hard clustering: MAhC(H, k, λ, α)
Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1],
real α ∈ R;
Let C = {(li , ri )}k
i=1 ← MAS(H, k, λ, α);
repeat
// Assignment
for i = 1, 2, ..., k do
Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj )};
end
// Centroid relocation
for i = 1, 2, ..., k do
ri ← h∈Ai
wi h
1−α
2
2
1−α
;
li ← h∈Ai
wi h
1+α
2
2
1+α
;
end
until convergence;

Coupled k-Means++ α-Seeding (extending k-means++)
Algorithm 1: Mixed α-seeding; MAS(H, k, λ, α)
Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1],
real α ∈ R;
Let C ← hj with uniform probability ;
for i = 2, 3, ..., k do
Pick at random histogram h ∈ H with probability:
πH(h)
eq
=
whMλ,α(ch : h : ch)
y∈H wy Mλ,α(cy : y : cy )
, (1)
// where (ch, ch)
eq
= arg min(z,z)∈C Mλ,α(z : h : z);
C ← C ∪ {(h, h)};
end
Output: Set of initial cluster centers C;
→ Guaranteed probabilistic bound. Just need to initialize! No
centroid computations as iterations not theoretically required

Learning statistical
mixtures with hard EM
k-GMLE [6]: fast,
guaranteed, low memory
footprint

Learning MMs: A geometric hard clustering viewpoint
Learn the parameters of a mixture m(x) = k
i=1 wi p(x|θi )
Maximize the complete data likelihood=clustering objective
function
max
W ,Λ
lc(W , Λ) =
n
i=1
k
j=1
zi,j log(wj p(xi |θj ))
= max
Λ
n
i=1
max
j∈[k]
log(wj p(xi |θj ))
≡ min
W ,Λ
n
i=1
min
j∈[k]
Dj (xi ) ,
where cj = (wj , θj ) (cluster prototype) and
Dj (xi ) = − log p(xi |θj ) − log wj are potential distance-like
functions.
⇒ further attach to each cluster (mixture component) a diﬀerent
family of probability distributions.

Generalized k-MLE: learning statistical EF
mixtures [?, 16, 15, 1, 8]
Model-based clustering: Assignment of points to clusters:
Dwj ,θj ,Fj
(x) = − log pFj
(x; θj ) − log wj
k-GMLE :
1. Initialize weight W ∈ ∆k and family type (F1, ..., Fk) for each
cluster
2. Solve minΛ i minj Dj (xi ) (center-based clustering for W
ﬁxed) with potential functions:
Dj (xi ) = − log pFj
(xi |θj ) − log wj
3. Solve family types maximizing the MLE in each cluster Cj by
choosing the parametric family of distributions Fj = F(γj )
that yields the best likelihood:
minF1=F(γ1),...,Fk =F(γk )∈F(γ) i minj Dwj ,θj ,Fj
(xi ).
∀l, γl = maxj F∗
j (ˆηl = 1
nl x∈Cl
tj (x)) + 1
nl x∈Cl
k(x).
4. Update weight W as the cluster point proportion
5. Test for convergence and go to step 2) otherwise.
Drawback = biased, non-consistent estimator due to Voronoic 2015 Frank Nielsen 13

Conformal divergences and
clustering. (by analogy to
Riemannian tensor metric)

Geometrically designed divergences
Plot of the convex generator F: Bregman [10], Jensen
(Burbea-Rao [9]), total Bregman [5].
q p
p+q
2
B(p : q)
J(p, q)
tB(p : q)
F : (x, F(x))
(p, F(p))
(q, F(q))

Divergences: Distortion measures
F a smooth convex function, the generator.
Skew Jensen divergences:
Jα(p : q) = αF(p) + (1 − α)F(q) − F(αp + (1 − α)q),
= (F(p)F(q))α − F((pq)α),
where (pq)γ = γp + (1 − γ)q = q + γ(p − q) and
(F(p)F(q))γ = γF(p)+(1−γ)F(q) = F(q)+γ(F(p)−F(q)).
Bregman divergences = limit cases of skew Jensen
B(p : q) = F(p) − F(q) − p − q, F(q) ,
lim
α→0
Jα(p : q) = B(p : q),
lim
α→1
Jα(p : q) = B(q : p).
Statistical Bhattacharrya divergence = Jensen for exponential
families [9]
Bhat(p1 : p2) = − log p1(x)α
p2(x)1−α
dν(x) = Jα(θ1 : θ2)

Total Bregman divergences
Conformal divergence, conformal factor ρ:
D (p : q) = ρ(p, q)D(p : q)
plays the rˆole of “regularizer” [17] and ensures robustness
Invariance by rotation of the axes of the design space
tB(p : q) =
B(p : q)
1 + F(q), F(q)
= ρB(q)B(p : q),
ρB(q) =
1
1 + F(q), F(q)
.
Total squared Euclidean divergence:
tE(p, q) =
1
2
p − q, p − q
1 + q, q
.

Total Jensen divergence: Illustration of the principle
p q(pq)α
F(p)
F(q)
(F(p)F(q))α
(F(p)F(q))β
Jα(p : q)
F((pq)α)
tJα(p : q)
F(p )
F(q )
(F(p )F(q ))α
(F(p )F(q ))β
Jα(p : q )
F((p q )α)
tJα(p : q )
p (p q )α
qO
O

Total Jensen divergences
tB(p : q) = ρB(q)B(p : q), ρB(q) =
1
1 + F(q), F(q)
tJα(p : q) = ρJ(p, q)Jα(p : q), ρJ(p, q) =
1
1 + (F(p)−F(q))2
p−q,p−q
Jensen-Shannon divergence, square root is a metric [3]:
JS(p, q) =
1
2
d
i=1
pi log
2pi
pi + qi
+
1
2
d
i=1
qi log
2qi
pi + qi
Lemma
The square root of the total Jensen-Shannon divergence is not a
metric.

Total Jensen divergences/Total Bregman divergences
Total Jensen is not a generalization of total Bregman.
limit cases α ∈ {0, 1}, we have:
lim
α→0
tJα(p : q) = ρJ(p, q)B(p : q) = ρB(q)B(p : q),
lim
α→1
tJα(p : q) = ρJ(p, q)B(q : p) = ρB(p)B(q : p),
since conformal factors ρJ(p, q) = ρB(q).

Conformal factor from mean value theorem
When p q, ρJ(p, q) ρB(q), and the total Jensen divergence
tends to the total Bregman divergence for any value of α.
ρJ(p, q) =
1
1 + F( ), F( )
= ρB( ),
for ∈ [p, q].
For univariate generators, explicitly the value of :
= F−1 ∆F
∆
= F∗ ∆F
∆
,
where F∗ is the Legendre convex conjugate [9].

Centroids and statistical robustness
Centroids (barycenters) are minimizers of average (weighted)
divergences:
L(x; w) =
n
i=1
wi × tJα(pi : x),
cα = arg min
x∈X
L(x; w),
Is it unique?
Is it robust to outliers [4]?
Iterative convex-concave procedure (CCCP) [9]

Clustering: No closed-form centroid, no cry!
k-means++ [2] picks up randomly seeds, no centroid calculation.
Algorithm 2: Total Jensen k-means++ seeding
Input: Number of clusters k ≥ 1;
Let C ← {hj } with uniform probability ;
for i = 2, 3, ..., k do
Pick at random h ∈ H with probability:
πH(h) =
tJα(ch : h)
y∈H tJα(cy : y)
where ch = arg minz∈C tJα(z : h);
C ← C ∪ {h};
end
Output: Set of initial cluster centers C;

Total Jensen divergences: Recap
Total Jensen divergence = conformal divergence with
non-separable double-sided conformal factor.
Invariant to axis rotation of “design space“
Equivalent to total Bregman divergences [17, 5] only when
p q
Square root of total Jensen-Shannon divergence is not a
metric but square root of total JS is a metric.
Total Jensen k-means++ do not require centroid
computations and guaranteed approximation
Interest of conformal divergences in SVM [18] (double-sided
separable), in information geometry [14] (ﬂattening).

Novel heuristics for
NP-hard center-based
clustering: merge-and-split
and (k, l)-means [11]

The k-means merge-and-split heuristic
Generalize Hartigan’s single-point relocation heuristic...
Consider pairs of clusters (Ci , Cj ) with centers ci and cj ,
merge them and split them again in two clusters using new
centers ci and cj . Accept when the sum of these two cluster
variance decreases:
∆(Ci , Cj ) = V (Ci , ci ) + V (Cj , cj ) − (V (Ci , ci ) + V (Cj , cj ))
How to split again two merged clusters (best splitting is
NP-hard)?
a discrete 2-means: We choose among the ni,j = ni + nj points
of Ci,j the two best centers (naively implemented in O(n3
)).
This yields a 2-approximation of 2-means.
a 2-means++ heuristic: We pick ci at random, then pick cj
randomly according to the normalized distribution of the
squared distances of the points in Ci,j to ci , see k-means++.
We repeat a given number α of rounds this initialization (say,
α = 1 + 0.01 ni,j
2 ) and keeps the best one.

The k-means merge-and-split heuristic
ops=number of pivot operations
Data set Hartigan Discrete Hartigan Merge&Split
cost #ops cost #ops cost #ops
Iris(d=4,n=150,k=3) 112.35 35.11 101.69 33.54 83.95 31.36
Wine(d=13,n=178,k=3) 607303 97.88 593319 100.02 570283 100.47
Yeast(d=8,n=1484,k=10) 47.10 1364.0 57.34 807.83 50.20 190.58
Data set Hartigan++ Discrete Hartigan++ Merge&Split++
cost #ops cost #ops cost #ops
Iris(d=4,n=150,k=3) 101.49 19.40 90.48 18.93 88.56 8.84
Wine(d=13,n=178,k=3) 3152616 18.76 2525803 24.61 2498107 9.67
Yeast(d=8,n=1484,k=10) 47.41 1192.38 54.96 640.89 51.82 66.30

The (k, l)-means heuristic: navigating on the local minima!
Associate to each pi to its l nearest cluster centers
NNl (pi ; K) (with iNNl = cluster center indexes), and
minimize the (k, l)-means objective function (with 1 ≤ l ≤ k):
e(P, K; l) =
n
i=1 a∈iNNl (pi ;K)
pi − ca
2
.
Assignment/relocation guarantees monotonous decrease.
Higher l means = local optima in optimization landscape
conversion to k-means
(k, l) ↓-means: convert a (k, l)-means by assigning to each
point pi its closest neighbor (among the l assigned at the end
of the (k, l)-means), and then compute the centroids and
launch a regular Lloyd’s k-means to ﬁnalize.
(k, l)-means: cascading conversion of (k, l)-means to
k-means: After convergence of (k, l)-means, initialize a
(k, l − 1) means by dropping for each point pi its farthest
cluster and perform a Lloyd’s (k, l − 1)-means, etc until we get
a (k, 1)-means=k-means. .

The (k, l)-means heuristic: 10000 trials
Data-set: Iris
(k, l) ↓-means: convert a (k, l)-means by assigning to each point pi its closest neighbor (among the l
assigned at the end of the (k, l)-means), and then compute the centroids and launch a regular Lloyd’s
k-means to ﬁnalize.
(k, l)-means: cascading conversion of (k, l)-means to k-means: After convergence of (k, l)-means,
initialize a (k, l − 1) means by dropping for each point pi its farthest cluster and perform a Lloyd’s
(k, l − 1)-means, etc until we get a (k, 1)-means=k-means. .
k win k-means (k, 2) ↓-means
min avg min avg
3 20.8 78.94 92.39 78.94 78.94
4 24.29 57.31 63.15 57.31 70.33
5 57.76 46.53 52.88 49.74 51.10
6 80.55 38.93 45.60 38.93 41.63
7 76.67 34.18 40.00 34.29 36.85
8 80.36 29.87 36.05 29.87 32.52
9 78.85 27.76 32.91 27.91 30.15
10 79.88 25.81 30.24 25.97 28.02
k l win k-means (k, l)-means
min avg min avg
5 2 58.3 46.53 52.72 49.74 51.24
5 4 62.4 46.53 52.55 49.74 49.74
8 2 80.8 29.87 36.40 29.87 32.54
8 3 61.1 29.87 36.19 32.76 34.04
8 6 55.5 29.88 36.189 32.75 35.26
10 2 78.8 25.81 30.61 25.97 28.23
10 3 82.5 25.95 30.23 26.47 27.76
10 5 64.7 25.90 30.32 26.99 28.61
On average better cost, but better local minima found by normal
k-means...

Thank you!

Bibliography I
Hartigans method for k-MLE]: Mixture Modeling with Wishart Distributions and Its Application to Motion
Retrieval, url=http://dx.doi.org/10.1007/978-3-319-05317-2 11, publisher=Springer International
Publishing, author=Saint-Jean, Christophe and Nielsen, Frank, pages=301-330.
In Frank Nielsen, editor, Geometric Theory of Information, Signals and Communication Technology. 2014.
David Arthur and Sergei Vassilvitskii.
k-means++: the advantages of careful seeding.
In Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages
1027–1035. Society for Industrial and Applied Mathematics, 2007.
Bent Fuglede and Flemming Topsoe.
Jensen-Shannon divergence and Hilbert space embedding.
In IEEE International Symposium on Information Theory, pages 31–31, 2004.
F. R. Hampel, P. J. Rousseeuw, E. Ronchetti, and W. A. Stahel.
Robust Statistics: The Approach Based on Inﬂuence Functions.
Wiley Series in Probability and Mathematical Statistics, 1986.
Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.
Shape retrieval using hierarchical total Bregman soft clustering.
Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.
Frank Nielsen.
k-MLE: A fast algorithm for learning statistical mixture models.
CoRR, abs/1203.5181, 2012.
preliminary version in ICASSP.
Frank Nielsen.
Jeﬀreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation
for frequency histograms.
Signal Processing Letters, IEEE, 20(7):657–660, 2013.

Bibliography II
Frank Nielsen.
On learning statistical mixtures maximizing the complete likelihood.
Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014),
1641:238–245, 2014.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhattacharyya centroids.
IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.
Frank Nielsen and Richard Nock.
Sided and symmetrized Bregman centroids.
Information Theory, IEEE Transactions on, 55(6):2882–2904, 2009.
Frank Nielsen and Richard Nock.
Further heuristics for k-means: The merge-and-split heuristic and the (k, l)-means.
arXiv preprint arXiv:1406.6314, 2014.
Frank Nielsen, Richard Nock, and Shun-ichi Amari.
On clustering histograms with k-means by using mixed α-divergences.
Entropy, 16(6):3273–3301, 2014.
Richard Nock, Panu Luosto, and Jyrki Kivinen.
Mixed Bregman clustering with approximation guarantees.
In Machine Learning and Knowledge Discovery in Databases, pages 154–169. Springer, 2008.
Atsumi Ohara, Hiroshi Matsuzoe, and Shun-ichi Amari.
A dually ﬂat structure on the space of escort distributions.
Journal of Physics: Conference Series, 201(1):012012, 2010.

Bibliography III
Olivier Schwander and Frank Nielsen.
Fast learning of gamma mixture models with k-mle.
In Similarity-Based Pattern Recognition, pages 235–249. Springer, 2013.
Olivier Schwander, Aurelien J Schutz, Frank Nielsen, and Yannick Berthoumieu.
k-mle for mixtures of generalized Gaussians.
In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 2825–2828. IEEE, 2012.
Baba Vemuri, Meizhu Liu, Shun-ichi Amari, and Frank Nielsen.
Total Bregman divergence and its applications to DTI analysis.
IEEE Transactions on Medical Imaging, pages 475–483, 2011.
Si Wu and Shun-ichi Amari.
Conformal transformation of kernel functions a data dependent way to improve support vector machine
classiﬁers.
Neural Processing Letters, 15(1):59–67, 2002.

Divergence-based clustering and applications of total Jensen divergences

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (11)

Similaire à Divergence-based clustering and applications of total Jensen divergences

Similaire à Divergence-based clustering and applications of total Jensen divergences (20)

Dernier

Dernier (20)

Divergence-based clustering and applications of total Jensen divergences