Hoang, T. A., & Lim, E. P. (2014, November). On joint modeling of topical communities and personal interest in microblogs. In International Conference on Social Informatics (pp. 1-16). Springer, Cham.
On Joint Modeling of Topical Communities and Personal Interest in Microblogs
1. ON JOINT MODELING OF
TOPICAL COMMUNITIES
AND PERSONAL INTEREST
IN MOCROBLOGS
Tuan-Anh Hoang and Ee-
Peng Lim
Living Analytics Research Centre
Singapore Management University
Social Informatics, 6th International Conference, SocInfo2014
1
3. ABSTRACT
• Propose the topical communities and personal
interest (TCPI) model for simultaneously modeling
topics, topical communities, and users’ topical
interests in microblogging data.
• Experiments on two Twitter datasets show that TCPI
can effectively mine the representative topics for
each topical community.
• Also demonstrate that TCPI significantly outperforms
other state-of-the-art topic models in the modeling
tweet generation task. 3
4. INTRODUCTION (1/7)
• Microblogging sites such as Twitter allow users to
publish short messages, which are called tweets
• Empirical and user studies on microblog usage have
showed that users may tweet about either their
personal topics or background topics
4
5. INTRODUCTION (2/7)
• Personal topics : covers individual interests of the
users
• Background topics : is the interest shared by users in
topical communities and they emerge when users in
the same community tweet about common interests.
they are thus the results of interests of the topical
communities.
5
6. INTRODUCTION (3/7)
• Previous works on modeling background topics in
social media as well as in general document corpuses.
• However, most of these works model a single
background topic or a distribution of background
topics.
• We instead consider the existence of multiple topical
communities, each with a different background topic
distribution.
6
7. INTRODUCTION (4/7)
• A user who is associated with a topical community
will therefore adopt topics from the interest of the
community.
• Hence, when modeling users on social media, we
have to consider both the user’s personal interests
and his topical communities.
7
8. INTRODUCTION (5/7)
• A way to identify the topical communities is first
performing topic modeling using one of existing
models (e.g., LDA) to find out topical interests of the
users, then assign the most common topics of all the
users to be the topical communities’ topics.
8
9. INTRODUCTION (6/7)
• Such an approach however does not allow us to
• distinguish between multiple topical communities,
• allow each topical community to have multiple
topics.
• Quantify the degree in which the user depends on
topical communities
9
10. INTRODUCTION (7/7)
• Therefore propose to jointly model user topical
interests and topical communities’ interests in a
same framework where each user has a parameter
controlling her bias toward generating content based
on her own interests or based on the topical
communities
10
11. INTRODUCTION – CONTRIBUTION
(1/2)
1. Propose a probabilistic graphical model, called
Topical Communities and Personal Interest model
(abbreviated as TCPI), for modeling topics and
topical communities, as well as modeling users’
topical interests and their dependency on the topical
communities in generating content.
2. Develop a sampling method to infer the model’s
parameters, and a regularization technique to bias
the model to learn more semantically clear topical
communities. 11
12. INTRODUCTION – CONTRIBUTION
(2/2)
3. We apply TCPI model on two Twitter datasets and
show that it significantly outperforms other state-
of-the-art models in modeling tweet generation
task.
4. An empirical analysis of topics and topical
communities for the two datasets has been
conducted to demonstrate the efficacy of the TCPI
model.
12
13. RELATED WORKS
• The works on analyzing topics in microblogs
• Works on analyzing communities in social networks
13
14. RELATED WORKS – TOPIC
ANALYSIS (1/5)
• Michelson et. al. first examined topical interests of
Twitter users by analyzing the named entities
mentioned in their tweets.
• Hong et. al. then conducted an empirical study on
different ways of performing topic modeling on
tweets using the original LDA model and Author-
topic model
• Mehrotra et. al. found that grouping the tweets
containing the same hashtags may lead to a
significant improvement of LDA model performance.14
15. RELATED WORKS – TOPIC
ANALYSIS (2/5)
• Qiu et. al. proposed to jointly modeling topics of
tweets and their associated posting behaviors
• Zhao et. al. proposed TwitterLDA topic model, which
is considered as state-of-the-art topic model for
microblogging data
15
16. RELATED WORKS – TOPIC
ANALYSIS (3/5)
• TwitterLDA is a variant of LDA, which :
a) documents are formed by aggregating tweets
posted by the same users
b) a single background topic is assumed
c) there is only one common topic for all words in
each tweet
d) each word in a tweet is generated from either the
background topic or the user’s topic.
16
18. RELATED WORKS – TOPIC
ANALYSIS (5/5)
18
• TwitterLDA model however does not consider
multiple background topics, and impractically
assume that all users have the same dependency on
the unique background topic (as the paramter μ is
common for all the users).
19. RELATED WORKS – COMMUNITY
ANALYSIS (1/3)
19
• Most of the early works on community analysis in
social networks are finding social communities
based on social links among the users.
• Newman discover social communities by finding
a network partition that maximizes a measure of
“compactness” in community structure called
modularity
20. RELATED WORKS – COMMUNITY
ANALYSIS (2/3)
20
• Airoldi et. al. proposed a statistical mixed
membership model, and works on finding topical
communities based on user generated content, and
users’ attributes and interest affiliations
• Ding et. al. conducted an empirical study showing
that social community structure of a social network
may significantly be different from topical
communities discovered from the same network
21. RELATED WORKS – COMMUNITY
ANALYSIS (3/3)
21
• Most of existing works do not differentiate users’
personal interests from those of topical communities,
assuming that a user’s topical interests is
determined purely based on her topical communities’
interests.
• This assumption is not practical since microblog
users express interest in a vast variety of topics of
daily life, and their interests are therefore not always
determined by their topical communities.
22. TCPI - ASSUMPTION (1/2)
22
• Users generate content topically
• For each user, there is always an underlying topic
explaining content of the every tweet she posts.
23. TCPI - ASSUMPTION (2/2)
23
• Users generate content according either to their
personal interests or some topical communities.
• While different users generally have different
personal topical interests, their generated content
also share some common topics of the topical
communities the users belong to.
24. TCPI - GENERATIVE PROCESS (1/5)
24
• User generated tweet from a vocabulary
V
• Has K latent topics
• Each topic k has a multinomial
distribution φk over the vocabulary V
• There are C topical communities
• Each community c has a multinomial
distribution σc over the K topics
• Each user u has a personal topic
distribution θu over the K topics and a
community distribution πu over the C
topical communities
25. TCPI - GENERATIVE PROCESS (2/5)
25
• Each user has a dependence
distribution μu which is a Bernoulli
distribution indicating how likely the
user tweets based on her own personal
interests ( 𝝁 𝒖
𝟎
) or based on the topical
communities ( 𝝁 𝒖
𝟏 = 1−𝝁 𝒖
𝟎 )
• Lastly, we assume that θu, πu, σ, and φ
have Dirichlet priors α, τ, η, and β
respectively, while μu has Beta prior ρ.
26. TCPI - GENERATIVE PROCESS (3/5)
26
• To generate a tweet t for user u
• we first flip a biased coin 𝒚 𝒕
𝒖
(whose
bias to head up is 𝝁 𝒖
𝟎
) to decide if the
tweet is based on u’s personal interests,
or based on one of the topical
communities u belongs to.
• If 𝒚 𝒕
𝒖
= 0, we then choose the topic zt
for the tweet according to u’s topic
distribution θu.
27. TCPI - GENERATIVE PROCESS (4/5)
27
• If 𝒚 𝒕
𝒖
= 1, we first choose a topical
community c according to u’s
community distribution πu, then we
choose zt according to the chosen
community’s topic distribution σc.
Assuming that each tweet has only one
topic, once the topic zt is chosen,
words in t are then chosen according to
the topic’s word distribution φzt .
29. TCPI - MODEL LEARNING
29
• Consider a set of microblogging users together with
their posted tweets, we now present the algorithm
for performing inference in the TCPI model.
• We use U to denote the number of users and use W
to denote the number of words in the tweet
vocabulary V. We denote the set of all posted tweets
in the dataset by T.
30. TCPI - MODEL LEARNING
30
• For each user ui, we denote her j-th tweet by 𝒕𝒋
𝒊
. For
each posted tweet 𝒕𝒋
𝒊
, we denote N words in the
tweet by 𝒘 𝟏
𝒊𝒋
, · · · , 𝒘 𝒏 𝒋
𝒊
𝒊𝒋
respectively, and we
denote the tweet’s topic, coin, and topical
community (if exists) by 𝒛𝒋
𝒊
, 𝒚𝒋
𝒊
, and 𝒄𝒋
𝒊
respectively.
Lastly, we denote the bag-of-topics, bag-of-coins,
and bag-of-topical communities of all the posted
tweets in the dataset by Z, Y, and C respectively.
31. TCPI - MODEL LEARNING
31
• Due to the intractability of LDA-based models, we
make use of sampling method in learning and
estimating the parameters in the TCPI model.
• Use a collapsed Gibbs sampler to iteratively and
jointly sample the latent coin and latent topical
community, and sample latent topic of every posted
tweet as follows.
32. 32
• For each posted tweet 𝒕𝒋
𝒊
, the j-th tweet posted by user ui
𝒀− 𝒕 𝒋
𝒊 , 𝑪− 𝒕 𝒋
𝒊, 𝒁− 𝒕 𝒋
𝒊 as bag-of-coins, bag-of-topical communities and bag-of-
topics of all other posted tweets in the dataset except the tweet 𝒕𝒋
𝒊
.
• The coin 𝒚𝒋
𝒊
and the topical community 𝒄𝒋
𝒊
of 𝒕𝒋
𝒊
are jointly sampled
according to equations in Figure 2
33. 33
• ny(c,u,C) records the number of times the coin y is observed in the
set of tweets of user u for the bag-of-coins Y.
• nzu(z,u,Z) records the number of times the topic z is observed in the
set of tweets of user u for the bag of topics Z
• nzc(z, c, Z, C) records the number of times the topic z is observed in
the set of tweets that are tweeted based on the topical community c
by any user for the bag-of-topics Z and the bag-of-topical
communities C
• ncu(c,u,C) records the number of times the topical community c is
34. 34
• The topic 𝒛𝒋
𝒊
of 𝒕𝒋
𝒊
is sampled according to equations in Figure 3.
• Note that when y𝒋
𝒊
= 0, we do not have to sample c𝒋
𝒊
, and the current c𝒋
𝒊
(if exists) will be discarded.
35. 35
• nzu(z,u,Z) records the number of times the topic z is observed in the
set of tweets of user u for the bag of topics Z
• nzc(z, c, Z, C) records the number of times the topic z is observed in
the set of tweets that are tweeted based on the topical community c
by any user for the bag-of-topics Z and the bag-of-topical
communities C
• nw(w, z, T , Z) records the number of times the word w is observed in
the topic z for the set of tweets T and the bag-of-topics Z.
36. TCPI - MODEL LEARNING
36
1. the first term is proportional to the probability that the coin 0 is
generated given the priors and (current) values of all other latent
variables (i.e., the coins, topical communities (if exist), and topics of
all other tweets)
2. the second term is proportional to the probability that the (current)
topic 𝒛𝒋
𝒊
is generated given the priors, (current) values of all other
37. TCPI - MODEL LEARNING
37
1. the first term is proportional to the probability that the coin 1 is generated
given the priors and (current) values of all other latent variables
2. the second term is proportional to the probability that the topical
community c is generated given the priors, (current) values of all other
latent variables, and the chosen coin
3. the third term is proportional to the probability that the (current) topic 𝒛𝒋
𝒊
is
generated given the priors, (current) values of all other latent variables, and
the chosen coin as well as the chosen community.
38. TCPI - MODEL LEARNING
38
The terms in the right hand side of Equations 3 and 4
respectively have the similar meaning with those of
Equations 1 and 2.
39. TCPI - SPARSITY REGULARIZATION
(1/12)
39
• We want topical communities’ topic
distributions and users’ topic distributions to
be skewed on different topics, and topical
communities’ topic distribution to be also
skewed on different topics.
40. TCPI - SPARSITY REGULARIZATION
(2/12)
40
In estimating parameters in the TCPI model, we
need to obtain sparsity in the following distribution :
• Topic specific coin distribution p(y|z) :
To ensure that each topic z is mostly covered by either
users’ personal interests or topical communities.
• Topic specific topical community distribution p(c|z) :
To ensure that each topic z is mostly covered by one or
only a few topical communities.
To obtain the sparsity mentioned above, we use the pseudo-
observed variable based regularization technique proposed
by Balasubramanyan et. al. as follows.
41. TCPI - SPARSITY REGULARIZATION
(3/12)
41
Topic Specific Coin Distribution Regularization
Since the topic specific coin distributions are
determined by both coin and community joint
sampling and topic sampling steps, we regularize
both these two steps to bias the distributions to
expected sparsity.
42. TCPI - SPARSITY REGULARIZATION
(4/12)
42
In Coin and Topical Community Joint Sampling Steps.
In each coin and topical community sampling step for
the tweet 𝒕𝒋
𝒊
, we multiply the right hand side of
equations in Figure 2 with a corresponding
regularization term RtopCoin-C&C(y| 𝒛𝒋
𝒊
) which is
computed based on empirical entropy of p(y| 𝒛𝒋
𝒊
) as in
Equation 5.
empirical entropy of p( y’| 𝒛𝒋
𝒊
) when
𝒚𝒋
𝒊
= y
Expected mean of entropy of
p(y|z),
pre-defined
Expected variance of entropy of p(y|z), pre-
defined
43. TCPI - SPARSITY REGULARIZATION
(5/12)
43
In Topic Sampling Steps.
In each topic sampling step for the tweet 𝒕𝒋
𝒊
, we
multiply the right hand side of equations in Figure 3
with a corresponding regularization term RtopCoin-
Topic(z| 𝒕𝒋
𝒊
) which is computed based on empirical
entropy of p(y|z) as in Equation 6.empirical entropy of p( y|z’) when 𝒛𝒋
𝒊
= z
Expected mean of entropy of
p(y|z)
Expected variance of entropy of
p(y|z)
44. TCPI - SPARSITY REGULARIZATION
(6/12)
44
With a low expected mean μtopCoin, these
regularization terms:
(1) increase weight for values of y, c, and z that give
lower empirical entropy of p(y|z), and hence
increasing the sparsity of these distributions
(2) decrease weight for values of y, c, and z that give
higher empirical entropy of p(y|z), and hence
decreasing the sparsity of these distributions.
45. TCPI - SPARSITY REGULARIZATION
(7/12)
45
Topic Specific Topical Community Distribution
Regularization
Since the topic specific topical community
distributions are determined by both coin and topical
community joint sampling and topic sampling steps,
we regularize both these two steps to bias the
distributions to expected sparsity.
46. TCPI - SPARSITY REGULARIZATION
(8/12)
46
In Coin and Topical Community Joint Sampling Steps.
In each coin and topical community sampling step for
the tweet 𝒕𝒋
𝒊
, we also multiply the right hand side of
equations in Figure 2 with a corresponding
regularization term RtopComm-C&C(y,c| 𝒛𝒋
𝒊
) which is
computed based on empirical entropy of p(c′| 𝒛𝒋
𝒊
) as in
Equation 7.
empirical entropy of p( c’| 𝒛𝒋
𝒊
) when
𝒚𝒋
𝒊
= y and 𝒄𝒋
𝒊
= c
Expected mean of entropy of
p(c|z),
pre-defined
Expected variance of entropy of p(c|z), pre-
defined
47. TCPI - SPARSITY REGULARIZATION
(9/12)
47
In Topic Sampling Steps.
In each topic sampling step for the tweet 𝒕𝒋
𝒊
, we also
multiply the right hand side of equations in Figure 3
with a corresponding regularization term RtopComm-
Topic(z| 𝒕𝒋
𝒊
) which is computed based on empirical
entropy of p(c|z) as in Equation 8empirical entropy of p( c|z’) when 𝒛𝒋
𝒊
= z
Expected mean of entropy of
p(c|z)
Expected variance of entropy of
p(c|z)
48. TCPI - SPARSITY REGULARIZATION
(10/12)
48
With a low expected mean μtopCo 𝒎𝒎
, these
regularization terms:
(1) increase weight for values of y, c, and z that give
lower empirical entropy of p(c|z), and hence
increasing the sparsity of these distributions
(2) decrease weight for values of y, c, and z that give
higher empirical entropy of p(c|z), and hence
decreasing the sparsity of these distributions.
50. TCPI - SPARSITY REGULARIZATION
(12/12)
50
Train the model with 600 iterations of Gibbs
sampling.
Took 25 samples with a gap of 20 iterations in the
last 500 iterations to estimate all the hidden variables.
51. EXPERIMENTAL EVALUATION –
DATASET (1/2)
51
Using snowball sampling :
SE Dataset. This dataset is collected from a set of
Twitter users who are interested in technology, and
particularly in software development from August 1st
to October 31st, 2011.
Two-Week Dataset. A large corpus of tweets collected
just before the 2012 US presidential election from
August 25th to September 7th, 2012 .
52. EXPERIMENTAL EVALUATION –
DATASET (2/2)
52
1. Removed stopwords from the tweets
2. Filtered out tweets with less than 3 non stop-
words.
3. Excluded users with less than 50 (remaining)
tweets.
53. EXPERIMENTAL EVALUATION -
EVALUATION METRICS
53
• Compare TCPI with TwitterLDA model.
• Adopt likelihood and perplexity for evaluating the
two models.
• For each user, we randomly selected 90% of tweets
of the user to form a training set, and use the
remaining 10% of the tweets as the test set
54. EXPERIMENTAL EVALUATION -
PERFORMANCE COMPARISION
54
(1) TCPI significantly
outperforms TwitterLDA in
topic modeling task
(2) TCPI is robust against the
number of topical
communities,
as its performance does not
significantly change as we
increase the number of the
communities from 1 to 5
55. EXPERIMENTAL EVALUATION -
BACKGROUND TOPICS AND TOPICAL
COMMUNITIES ANALYSIS (1/6)
55
Considering
1. both time and space complexities
2. it is not practical to expect a large number of
topics falling in topical communities,
we set the number of the topical communities in TCPI
model to 3, and set the number of topics in both
models to 80.
57. EXPERIMENTAL EVALUATION -
BACKGROUND TOPICS AND TOPICAL
COMMUNITIES ANALYSIS (3/6)
57
• the topic’s top words
are the words having
the highest likelihoods
given the topic
• the topic’s top tweets
are the tweets having
the lowest perplexities
given the topic
58. EXPERIMENTAL EVALUATION -
BACKGROUND TOPICS AND TOPICAL
COMMUNITIES ANALYSIS (4/6)
58
• The background topic found by TwitterLDA model
is not sematically clear
• The topical communitiess and their extreme topics
found by TCPI model are both semantically clear
and reasonable.
• Agrees with the findings by Zhao et. al. that people
also use Twitter for gathering and sharing useful
information relevant to their profession.
61. CONCLUSION
61
• Propose a novel topic model called TCPI for
simultaneously modeling topical communities and
users’ topical interests in microblogging data. Our
model differentiates users’ personal interests from
their topical communities while learning both the
two set of latent factors at the same time.
• Also report experiments on two Twitter datasets
showing the effectiveness of the proposed model.
TCPI is shown to outperform TwitterLDA, another
state-of- the-art topic model for modeling tweet
generation.
62. CONCLUSION – FUTURE WORKS
62
• To consider the scalability of the proposed model.
Possible solutions for scaling up the model are
approximated and distributed implementations of
Gibbs sampling procedures, and stale synchronous
parallel implementation of variational inference
procedures.
• It’s potentially helpful to incorporate prior
knowledge into the proposed model. Examples of
the prior knowledge are topic indicative feature,
and groundtruth community labels for some users.