1. The Noise Cluster Model:
a Greedy Solution to the Network Communities
Extraction Problem
Etienne Cˆme,
o
come@inrets.fr
&
Eustache Diemert,
ediemert@bestofmedia.com
6 octobre 2010
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 1 / 32
2. Outline
1 Introduction
2 Existing solutions for the community extraction problem
3 Background on Erd¨s-R´nyi mixture
o e
4 The noise cluster model
5 Community extraction using the noise cluster model
6 Preliminary experiments : Blogs communities extraction
7 Conclusion & future works
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 2 / 32
3. Introduction
Introduction
Motivations
Extract one community using seeds nodes from the community
On-line algorithm (do not store the full graph)
Solution : Community extraction
extract one community
semi-supervised method : some community members are known
Solution : Noise cluster model
simple generative model
one community surrounded by noise
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 3 / 32
4. Introduction
Introduction, (toy example)
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 4 / 32
8. Introduction
Introduction, (community extraction)
Usefull
Useless
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 8 / 32
9. Introduction
Advantages
seeds give a focus to process the graph
better complexity
the exploration of the full graph can be avoided
no problem of balance between communities size
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 9 / 32
10. Existing solutions for the community extraction problem
Existing solutions for the community extraction problem
Bagrow & al [BB05]
growing a breadth first tree outward from one seed node ;
until the rate of expansion falls below an arbitrary threshold. (i.e. the
proportion of edges found at the current level which lead to nodes
which are yet unknown)
Problems
can only deal with one seed
all node of a level are included
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 10 / 32
11. Existing solutions for the community extraction problem
Existing solutions for the community extraction problem
Clauset [Cla05]
greedy optimization of a quantity called local modularity Lmod ;
boundary B : the subset of known nodes that have at least one
neighbour in the set of yet unknown nodes ;
local modularity : number of edges between this set and the set of
known nodes C over the total number of edges with one extremity in
this set.
i∈C,j∈B Bij + i∈B,j∈C Bij
Lmod = , (1)
i,j Bij
with Bij = 1 if i and j are connected and either vertex is in B.
Problems
can only deal with one seed
stopping criteria tuning
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 11 / 32
12. Existing solutions for the community extraction problem
Existing solutions for the community extraction problem
Other solutions
[AL06] random walks and conductances
[SG10] combinatorial algorithms
Problems
complexity scales with the graph size
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 12 / 32
13. Background on Erd¨s-R´nyi mixture
o e
Graph clustering
Generative setting (Erd¨s-R´nyi mixture, block-model)
o e
Variables definition :
Xij are binary variables defining presence // absence of link from node
i to node j :
1, if there is a link from i to j
xij = (2)
0, otherwise.
Zjk are dummy variables encoding cluster membership, they take their
values zjk :
1, if j belongs to cluster k
zjk = (3)
0, otherwise.
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 13 / 32
14. Background on Erd¨s-R´nyi mixture
o e
Erd¨s-R´nyi mixture
o e
Model definition [DPS08]
i.i.d
Zjk ∼ M(1, γ), ∀i ∈ {1, . . . , N} (4)
i.i.d
Xij |Zik × Zjl = 1 ∼ B(πkl ), ∀i, j ∈ {1, . . . , N} (5)
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 14 / 32
15. Background on Erd¨s-R´nyi mixture
o e
Erd¨s-R´nyi mixture
o e
Figure: Adjacency matrix simulated using an Erd¨s-R´nyi mixture
o e
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 15 / 32
16. The noise cluster model
The noise cluster model
Model definition
i.i.d
Zi ∼ B(γ), ∀i ∈ {1, . . . , N}, (6)
i.i.d
Xij |Zi × Zj = 1 ∼ B(α), ∀i, j ∈ {1, . . . , N}, (7)
i.i.d
Xij |Zi × Zj = 0 ∼ B(β), ∀i, j ∈ {1, . . . , N}, (8)
with zi = 1, if i belongs to the community and 0 otherwise.
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 16 / 32
17. The noise cluster model
The noise cluster model
Figure: Adjacency matrix simulated using the noise cluster model.
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 17 / 32
18. The noise cluster model
Basics quantities
Community size :
Nc = zi
i
Nodes degrees :
djin = xij , djout = xji , dj = (xij + xji )
i:zi =1 i:zi =1 i:zi =1
Posteriors probabilities :
pjin = P(Zj = 1|Xij = xij , Zi = zi , ∀i ∈ {1, . . . , N}),
pjout = P(Zj = 1|Xji = xji , Zi = zi , ∀i ∈ {1, . . . , N}),
pjin,out = P(Zj = 1|Xij = xij , Xji = xji , Zi = zi , ∀i ∈ {1, . . . , N}),
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 18 / 32
19. The noise cluster model
Simplifications :
Community membership posterior probabilities are the quantities of
interest to determine if a node must be added to the community. They
depend uniquely on :
parameters (α, β, γ) ;
links with community members (djin , djout , djin,out respectively) ;
community size (Nc) ;
Example for pjin
in in
αdj × (1 − α)(Nc−dj ) × γ
pjin = in in in in
αdj × (1 − α)(Nc−dj ) × γ + β dj × (1 − β)(Nc−dj ) × (1 − γ)
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 19 / 32
20. The noise cluster model
Community membership test
Community membership test equivalent to threshold the number of shared
links with community members.
{pjin > s} ⇔ {djin > dmin }, (9)
with
log s × (1 − β)Nc × (1 − γ) − log (1 − s) × (1 − α)Nc × γ
dmin =
log (α × (1 − β)) − log ((1 − α) × β)
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 20 / 32
21. The noise cluster model
alpha=0.1,beta=0.001,gamma=0.05,Nc=200
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0.8
pc
0.4
q
0.0
qqqq
0 10 20 30 40 50
din
alpha=0.1,beta=0.001,gamma=0.05
10
8
dmin
6
4
2
0 100 200 300 400
Nc
Figure: (top) values of pjin with respect to djin with α = 0.1,
β = 0.001, γ = 0.05 and Nc = 200 ; (bottom) dmin evolution with respect to the
community size Nc with α = 0.1, β = 0.001, γ = 0.05 and s = 0.5.
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 21 / 32
22. Community extraction using the noise cluster model
Online learning [ZAM08]
Classification likelihood
In the case of a full adjacency matrix, the classification log-likelihood is
defined as :
Lc (X, Z, θ) = zi log(γ) + (1 − zi ) log(1 − γ)
i i
+ zi × zj × xij log(α) + zi × zj (1 − ×xij ) log(1 − α)
i,j:i=j i,j:i=j
+ (1 − zi × zj ) × xij log(β) + (1 − zi × zj ) × (1 − xij ) log(1 − β)
i,j:i=j i,j:i=j
with Z = {z1 , . . . , zN }, X = {xij : i = j, i, j ∈ {1, . . . , N}}, and
θ = (γ, α, β) the parameters vector.
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 22 / 32
23. Community extraction using the noise cluster model
Online learning [ZAM08]
Maximisation for known partition
If the partition Z = {z1 , . . . , zN } is known and with a square adjacency
matrix of size N × N, the parameter vector maximizing the Classification
likelihood is given by :
Nc
γ =
ˆ , (10)
N
N
1
α =
ˆ 2
(zi × zj )xij , (11)
Nc
i,j=1, i=j
N
ˆ 1
β = (1 − zi × zj )xij , (12)
Nc × (N + Nc )
¯
i,j=1, i=j
with Nc the number of nodes belonging to the community and Nc the
¯
number of nodes that do not belong to the community.
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 23 / 32
24. Community extraction using the noise cluster model
Proposed community extraction procedure
Algorithm
Use a breadth first algorithm to explore the graph starting from the seeds,
for each traversed vertex :
1 use community membership test (9) to add it or not to the
community
2 update parameters (using 10, 11, 12), taking into account the current
partition
until no more vertex can be added to the community.
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 24 / 32
25. Preliminary experiments : Blogs communities extraction
Preliminary experiments : Blogs communities extraction
Settings
multi-threaded web crawler coupled with the proposed community
extraction procedure ;
seeds URLs taken from Wikio (http ://www.wikio.com) which
proposes several rankings of blogs for several topics ;
theses ranking were used to provide 100 or 50 seeds to the algorithm
for 4 test communities.
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 25 / 32
27. Preliminary experiments : Blogs communities extraction
Blogs community extraction
names level
1 www.bouletcorp.com 0
2 louromano.blogspot.com 2
3 www.cartoonbrew.com 2
4 yacinfields.blogspot.com 1
5 polyminthe.blogspot.com 1
6 marnette.canalblog.com 1
7 blackwingdiaries.blogspot.com 2
8 bastienvives.blogspot.com 1
9 donshank.blogspot.com 2
10 john-nevarez.blogspot.com 2
Table: Best site according to local page rank for the Comics (fr) community
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 27 / 32
28. Preliminary experiments : Blogs communities extraction
Figure: Word clouds for Politics (us). The first 50 words in descending order of
their Kullback-Leibler divergence are kept(between word document frequency in
the community and in a negative class of 10000 random blogs, texts have been
first preprocessed using a stop list and stemming). Words size are proportional to
the word document frequencies in the community.
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 28 / 32
29. Preliminary experiments : Blogs communities extraction
Figure: Word clouds for Food (us). The first 50 words in descending order of
their Kullback-Leibler divergence are kept(between word document frequency in
the community and in a negative class of 10000 random blogs, texts have been
first preprocessed using a stop list and stemming). Words size are proportional to
the word document frequencies in the community.
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 29 / 32
30. Conclusion & future works
Conclusion & future works
Conclusion
simple, greedy approach ;
complexity scales with the community size not the graph size ;
blog community extraction was performed using such a tool with
success.
Future works
More work is needed to better understand and evaluate the approach :
test the robustness of the methods to noise in the seeds set ;
test with other application domains (with ground truth) ;
test using graph generation algorithms.
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 30 / 32
31. Conclusion & future works
R. Andersen and K. Lang.
Communities from seed sets.
In Proceedings of the 15th International Conference on World Wide Web, pages 223–232.
ACM Press, 2006.
J.P. Bagrow and E.M. Bollt.
A local method for detecting communities.
Phys Rev E Stat Nonlin Soft Matter Phys, 72(4) :046108, 2005.
A. Clauset.
Finding local community structure in networks.
Phys Rev E Stat Nonlin Soft Matter Phys, 72(2) :026132, 2005.
J. Daudin, F. Picard, and Robin S.
A mixture model for random graph.
Statistics and computing, 18 :1–36, 2008.
M. Sozio and A. Gionis.
The community-search problem and how to plan a successful cocktail party.
In Proceedings of the 16th ACM SIGKDD Conference On Knowledge Discovery and Data
Mining (KDD), pages –, 2010.
H. Zanghi, C. Ambroise, and V. Miele.
Fast online graph clustering via erdos-renyi mixture.
Pattern Recognition, 41(12) :3592–3599, December 2008.
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 31 / 32
32. Conclusion & future works
Thanks for your attention !
Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 32 / 32