Marami 2010

The Noise Cluster Model:
a Greedy Solution to the Network Communities
Extraction Problem

Etienne Cˆme,
o
come@inrets.fr
&
Eustache Diemert,
ediemert@bestofmedia.com

6 octobre 2010

Cˆme & Diemert (INRETS, BestOfMedia)
o The Noise Cluster Model 6 octobre 2010 1 / 32

Outline

1 Introduction

2 Existing solutions for the community extraction problem

3 Background on Erd¨s-R´nyi mixture
o e

4 The noise cluster model

5 Community extraction using the noise cluster model

6 Preliminary experiments : Blogs communities extraction

7 Conclusion & future works


Introduction

Introduction

Motivations
Extract one community using seeds nodes from the community
On-line algorithm (do not store the full graph)

Solution : Community extraction
extract one community
semi-supervised method : some community members are known

Solution : Noise cluster model
simple generative model
one community surrounded by noise


Introduction

Introduction, (toy example)


Introduction

Introduction, (graph clustering)


Introduction

Introduction, (seeds)


Introduction

Introduction, (community extraction)


Introduction

Introduction, (community extraction)

Usefull

Useless


Introduction

Advantages

seeds give a focus to process the graph
better complexity
the exploration of the full graph can be avoided
no problem of balance between communities size


Existing solutions for the community extraction problem


Bagrow & al [BB05]
growing a breadth ﬁrst tree outward from one seed node ;
until the rate of expansion falls below an arbitrary threshold. (i.e. the
proportion of edges found at the current level which lead to nodes
which are yet unknown)

Problems
can only deal with one seed
all node of a level are included



Clauset [Cla05]
greedy optimization of a quantity called local modularity Lmod ;
boundary B : the subset of known nodes that have at least one
neighbour in the set of yet unknown nodes ;
local modularity : number of edges between this set and the set of
known nodes C over the total number of edges with one extremity in
this set.
i∈C,j∈B Bij + i∈B,j∈C Bij
Lmod = , (1)
i,j Bij

with Bij = 1 if i and j are connected and either vertex is in B.

Problems
can only deal with one seed
stopping criteria tuning



Other solutions
[AL06] random walks and conductances
[SG10] combinatorial algorithms

Problems
complexity scales with the graph size


Background on Erd¨s-Rńyi mixture
o e

Graph clustering

Generative setting (Erd¨s-Rńyi mixture, block-model)
o e
Variables definition :
Xij are binary variables defining presence // absence of link from node
i to node j :

1, if there is a link from i to j
xij = (2)
0, otherwise.

Zjk are dummy variables encoding cluster membership, they take their
values zjk :
1, if j belongs to cluster k
zjk = (3)
0, otherwise.


o e

Erd¨s-R´nyi mixture
o e

Model deﬁnition [DPS08]

i.i.d
Zjk ∼ M(1, γ), ∀i ∈ {1, . . . , N} (4)
i.i.d
Xij |Zik × Zjl = 1 ∼ B(πkl ), ∀i, j ∈ {1, . . . , N} (5)


o e

Erd¨s-R´nyi mixture
o e

Figure: Adjacency matrix simulated using an Erd¨s-R´nyi mixture
o e


The noise cluster model


Model deﬁnition

i.i.d
Zi ∼ B(γ), ∀i ∈ {1, . . . , N}, (6)
i.i.d
Xij |Zi × Zj = 1 ∼ B(α), ∀i, j ∈ {1, . . . , N}, (7)
i.i.d
Xij |Zi × Zj = 0 ∼ B(β), ∀i, j ∈ {1, . . . , N}, (8)

with zi = 1, if i belongs to the community and 0 otherwise.




Figure: Adjacency matrix simulated using the noise cluster model.



Basics quantities
Community size :
Nc = zi
i

Nodes degrees :

djin = xij , djout = xji , dj = (xij + xji )
i:zi =1 i:zi =1 i:zi =1

Posteriors probabilities :

pjin = P(Zj = 1|Xij = xij , Zi = zi , ∀i ∈ {1, . . . , N}),
pjout = P(Zj = 1|Xji = xji , Zi = zi , ∀i ∈ {1, . . . , N}),
pjin,out = P(Zj = 1|Xij = xij , Xji = xji , Zi = zi , ∀i ∈ {1, . . . , N}),



Simpliﬁcations :
Community membership posterior probabilities are the quantities of
interest to determine if a node must be added to the community. They
depend uniquely on :
parameters (α, β, γ) ;
links with community members (djin , djout , djin,out respectively) ;
community size (Nc) ;

Example for pjin

in in
αdj × (1 − α)(Nc−dj ) × γ
pjin = in in in in
αdj × (1 − α)(Nc−dj ) × γ + β dj × (1 − β)(Nc−dj ) × (1 − γ)



Community membership test
Community membership test equivalent to threshold the number of shared
links with community members.

{pjin > s} ⇔ {djin > dmin }, (9)

with

log s × (1 − β)Nc × (1 − γ) − log (1 − s) × (1 − α)Nc × γ
dmin =
log (α × (1 − β)) − log ((1 − α) × β)



alpha=0.1,beta=0.001,gamma=0.05,Nc=200

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

0.8
pc

0.4
q

0.0
qqqq

0 10 20 30 40 50

din

alpha=0.1,beta=0.001,gamma=0.05
10
8
dmin

6
4
2

0 100 200 300 400

Nc

Figure: (top) values of pjin with respect to djin with α = 0.1,
β = 0.001, γ = 0.05 and Nc = 200 ; (bottom) dmin evolution with respect to the
community size Nc with α = 0.1, β = 0.001, γ = 0.05 and s = 0.5.


Community extraction using the noise cluster model

Online learning [ZAM08]

Classification likelihood
In the case of a full adjacency matrix, the classification log-likelihood is
defined as :

Lc (X, Z, θ) = zi log(γ) + (1 − zi ) log(1 − γ)
i i

+ zi × zj × xij log(α) + zi × zj (1 − ×xij ) log(1 − α)
i,j:i=j i,j:i=j

+ (1 − zi × zj ) × xij log(β) + (1 − zi × zj ) × (1 − xij ) log(1 − β)
i,j:i=j i,j:i=j

with Z = {z1 , . . . , zN }, X = {xij : i = j, i, j ∈ {1, . . . , N}}, and
θ = (γ, α, β) the parameters vector.



Online learning [ZAM08]
Maximisation for known partition
If the partition Z = {z1 , . . . , zN } is known and with a square adjacency
matrix of size N × N, the parameter vector maximizing the Classiﬁcation
likelihood is given by :
Nc
γ =
ˆ , (10)
N
N
1
α =
ˆ 2
(zi × zj )xij , (11)
Nc
i,j=1, i=j
N
ˆ 1
β = (1 − zi × zj )xij , (12)
Nc × (N + Nc )
¯
i,j=1, i=j

with Nc the number of nodes belonging to the community and Nc the
¯
number of nodes that do not belong to the community.


Proposed community extraction procedure

Algorithm
Use a breadth ﬁrst algorithm to explore the graph starting from the seeds,
for each traversed vertex :
1 use community membership test (9) to add it or not to the
community
2 update parameters (using 10, 11, 12), taking into account the current
partition
until no more vertex can be added to the community.


Preliminary experiments : Blogs communities extraction


Settings
multi-threaded web crawler coupled with the proposed community
extraction procedure ;
seeds URLs taken from Wikio (http ://www.wikio.com) which
proposes several rankings of blogs for several topics ;
theses ranking were used to provide 100 or 50 seeds to the algorithm
for 4 test communities.



Blogs communities extraction

Comics (Fr) Scrapbooking (Fr) Food (U.S.) Politics (U.S.)
Nb seed 100 100 50 50
Nc 1 263 1 130 1 681 1 884
Nb edges 20 434 24 248 100 597 74 219
α 0.01821 0.01899 0.03560 0.02091
β 0.00093 0.00147 0.00091 0.00065
γ 0.03048 0.05579 0.03060 0.01808
Biggest S.C.C. 1 251 1 129 1 667 1 877
Max Level 3 2 5 4
Diameter 6 7 7 8
Radius 4 4 4 3
Clustering Coeﬀ. 0.287 0.265 0.381 0.320
Transitivity 0.198 0.2 0.290 0.223

Table: Global statistics and model parameters for 4 communities.



Blogs community extraction

names level
1 www.bouletcorp.com 0
2 louromano.blogspot.com 2
3 www.cartoonbrew.com 2
4 yacinﬁelds.blogspot.com 1
5 polyminthe.blogspot.com 1
6 marnette.canalblog.com 1
7 blackwingdiaries.blogspot.com 2
8 bastienvives.blogspot.com 1
9 donshank.blogspot.com 2
10 john-nevarez.blogspot.com 2
Table: Best site according to local page rank for the Comics (fr) community



Figure: Word clouds for Politics (us). The ﬁrst 50 words in descending order of
their Kullback-Leibler divergence are kept(between word document frequency in
the community and in a negative class of 10000 random blogs, texts have been
ﬁrst preprocessed using a stop list and stemming). Words size are proportional to
the word document frequencies in the community.



Figure: Word clouds for Food (us). The ﬁrst 50 words in descending order of
their Kullback-Leibler divergence are kept(between word document frequency in
the community and in a negative class of 10000 random blogs, texts have been
ﬁrst preprocessed using a stop list and stemming). Words size are proportional to
the word document frequencies in the community.


Conclusion & future works


Conclusion
simple, greedy approach ;
complexity scales with the community size not the graph size ;
blog community extraction was performed using such a tool with
success.

Future works
More work is needed to better understand and evaluate the approach :
test the robustness of the methods to noise in the seeds set ;
test with other application domains (with ground truth) ;
test using graph generation algorithms.



R. Andersen and K. Lang.
Communities from seed sets.
In Proceedings of the 15th International Conference on World Wide Web, pages 223–232.
ACM Press, 2006.
J.P. Bagrow and E.M. Bollt.
A local method for detecting communities.
Phys Rev E Stat Nonlin Soft Matter Phys, 72(4) :046108, 2005.

A. Clauset.
Finding local community structure in networks.
Phys Rev E Stat Nonlin Soft Matter Phys, 72(2) :026132, 2005.

J. Daudin, F. Picard, and Robin S.
A mixture model for random graph.
Statistics and computing, 18 :1–36, 2008.

M. Sozio and A. Gionis.
The community-search problem and how to plan a successful cocktail party.
In Proceedings of the 16th ACM SIGKDD Conference On Knowledge Discovery and Data
Mining (KDD), pages –, 2010.

H. Zanghi, C. Ambroise, and V. Miele.
Fast online graph clustering via erdos-renyi mixture.
Pattern Recognition, 41(12) :3592–3599, December 2008.



Thanks for your attention !


Marami 2010

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (6)

Marami 2010