1. The document proposes methods for discovering communities in social networks using content and interactions by modeling communities based on discussed topics and social connections between users. This allows discovering both user interests and popular topics within each community.
2. Bayesian models are used to extract latent communities from the network, assuming community relationships depend on user interests in topics and their links. Different models are proposed to handle different network structures like broadcast vs conversation networks.
3. The models aim to utilize both content and link information to discover communities in incomplete social networks with missing link information. A distance metric is learned using observed links and used for hierarchical clustering.
2. Abstract
Problem:discovering
meaningful communities
from a social network
We propose generative models that can
discover communities based on the discussed
topics, interaction types and the social
connections among people.
Person->multiple communities->multiple topics
We discover both community interests and user
interests based on the information and linked
associations.
3. Introduction
Background:
rich data -> academia & business;
discover relationships -> discover community
A
community is a collection of users as a group
such that there is high relatedness among people
within the group.
One common approach used is to treat
communities as group of nodes in social network
that are more densely connected among
themselves than with the rest of the network.
A graph clustering problem
4. We
consider communities as “groups of
users(nodes) who are interconnected and
communicate on shared topics”.
1.
采用Bayesian models来提取潜在的
communities。模型假设:社区关系是依赖
于用户间感兴趣的topics和他们之间的链接
关系的。这种方法有助于发现用户兴趣和他
在网络中的角色。同时还能发现一个社区里
流行的话题。所以,给定一个主题或兴趣时,
就可以以此找到相关的社区。
5. 2.
We also utilize the “type” of interactions
between users to emphasize their interest in
topics, and thus community membership.
3.
e.g, conversation vs broadcast
两种社交网络:1. 用户的posts 广播给他的邻居;2.
用户只能直接给其他人发送posts(比如 email
networks);所以本文推荐了两种不同的方法对应两
种不同的网络结构。
假设:post只讨论单个topic,为了减少模型训练时间。
但是当post很长时,这个假设就不合适了,所以本文同时
给出了另一个模型适应这个问题。
6. PRIOR WORK
第一种:只考虑用户间的links。不考虑其他节
点特性和user interactions. 不允许一个user属
于多个communities。
第二种:Bayesian probabilistic models . 可
以解决一对多的问题,但仍太依赖于link
structure来发现communities.
第三种:利用语义内容来发现communities。
Communities are modeled as random
mixtures over users who in turn have a
topical distribution (interest) associated
with them. 没有利用链接信息。
21. EXPERIMENTS
Datasets:
Twitter over a period of six months in 2009
Enron Email corpus
we
set the number of communities C at
10 and topics Z at 20
We ran 1000 iterations to burn in and took
250 samples (every fourth sample) in the
next 1000 iterations .
31. CONCLUSION
we
proposed probabilistic schemes that
incorporate topics, social relation ships
and nature of posts for more effective
community discovery .
Interaction types are important
33. Abstract
detecting
communities in incomplete information
networks with missing edges.
1. learn a distance metric to reproduce the linkbased distance between nodes from the
observed edges in the local information regions
2. Use the learned distance metric to estimate the
distance between any pair of nodes in the
network.
A hierarchical clustering approach
34. INTRODUCTION
The
community is defined as a group of nodes
which are densely connected inside the group,
while loosely connected with the nodes outside
the group.
The local regions with complete linkage
information are called local information regions .
Terrorist-attack
Food
Web
network .
35.
36. contributes
We identify and define the problem of community
detection in incomplete information networks with
local information regions
Then a metric, which can be used to measure the
distance between any pair of nodes, is learned.
Based on the learned metric, we devise a
distance-based modularity function to evaluate
the quality of the communities.
We propose a distance-based algorithm DSHRINK
which can discover the hierarchical and
overlapped communities.
37. RELATED WORK
1.
2.
3.
focused on the topological structures
Some graph clustering methods which
based on attributes.
some clustering methods based on both
links and attributes were also proposed
47. EXPERIMENTS
Data
Sets
DBLP-A Dataset: DBLP-A is the data set
extracted from DBLP database which provides
bibliographic information on computer
science journals and proceeding.
DBLP-B Dataset:
48. Incomplete
Snowball
Information Network Generation
sampling
parameter
p ,called sample ratio
parameter q ,called local information region
size
49. Evaluation
Measures
The definition of purity is as follows:
each cluster is first assigned with the most
frequent class in the cluster, and then the
purity is measured by computing the
number of the instances assigned with the
same labels in all clusters.
50. Compared
Methods
Kmeans:
Md +DSHRINK: We learn a diagonal
Mahalanobis matrix Md and use it as the
input of M for DSHRINK.
Mf +DSHRINK: We learn a full Mahalanobis
matrix Mf and use it as the input of M for
DSHRINK.