SlideShare une entreprise Scribd logo
1  sur  63
ON JOINT MODELING OF
TOPICAL COMMUNITIES
AND PERSONAL INTEREST
IN MOCROBLOGS
Tuan-Anh Hoang and Ee-
Peng Lim
Living Analytics Research Centre
Singapore Management University
Social Informatics, 6th International Conference, SocInfo2014
1
OUTLINE
2
• Abstract
• Introduction
• Topical Community and personal interest model
• Experimental Evaluation
• Conclusion
ABSTRACT
• Propose the topical communities and personal
interest (TCPI) model for simultaneously modeling
topics, topical communities, and users’ topical
interests in microblogging data.
• Experiments on two Twitter datasets show that TCPI
can effectively mine the representative topics for
each topical community.
• Also demonstrate that TCPI significantly outperforms
other state-of-the-art topic models in the modeling
tweet generation task. 3
INTRODUCTION (1/7)
• Microblogging sites such as Twitter allow users to
publish short messages, which are called tweets
• Empirical and user studies on microblog usage have
showed that users may tweet about either their
personal topics or background topics
4
INTRODUCTION (2/7)
• Personal topics : covers individual interests of the
users
• Background topics : is the interest shared by users in
topical communities and they emerge when users in
the same community tweet about common interests.
they are thus the results of interests of the topical
communities.
5
INTRODUCTION (3/7)
• Previous works on modeling background topics in
social media as well as in general document corpuses.
• However, most of these works model a single
background topic or a distribution of background
topics.
• We instead consider the existence of multiple topical
communities, each with a different background topic
distribution.
6
INTRODUCTION (4/7)
• A user who is associated with a topical community
will therefore adopt topics from the interest of the
community.
• Hence, when modeling users on social media, we
have to consider both the user’s personal interests
and his topical communities.
7
INTRODUCTION (5/7)
• A way to identify the topical communities is first
performing topic modeling using one of existing
models (e.g., LDA) to find out topical interests of the
users, then assign the most common topics of all the
users to be the topical communities’ topics.
8
INTRODUCTION (6/7)
• Such an approach however does not allow us to
• distinguish between multiple topical communities,
• allow each topical community to have multiple
topics.
• Quantify the degree in which the user depends on
topical communities
9
INTRODUCTION (7/7)
• Therefore propose to jointly model user topical
interests and topical communities’ interests in a
same framework where each user has a parameter
controlling her bias toward generating content based
on her own interests or based on the topical
communities
10
INTRODUCTION – CONTRIBUTION
(1/2)
1. Propose a probabilistic graphical model, called
Topical Communities and Personal Interest model
(abbreviated as TCPI), for modeling topics and
topical communities, as well as modeling users’
topical interests and their dependency on the topical
communities in generating content.
2. Develop a sampling method to infer the model’s
parameters, and a regularization technique to bias
the model to learn more semantically clear topical
communities. 11
INTRODUCTION – CONTRIBUTION
(2/2)
3. We apply TCPI model on two Twitter datasets and
show that it significantly outperforms other state-
of-the-art models in modeling tweet generation
task.
4. An empirical analysis of topics and topical
communities for the two datasets has been
conducted to demonstrate the efficacy of the TCPI
model.
12
RELATED WORKS
• The works on analyzing topics in microblogs
• Works on analyzing communities in social networks
13
RELATED WORKS – TOPIC
ANALYSIS (1/5)
• Michelson et. al. first examined topical interests of
Twitter users by analyzing the named entities
mentioned in their tweets.
• Hong et. al. then conducted an empirical study on
different ways of performing topic modeling on
tweets using the original LDA model and Author-
topic model
• Mehrotra et. al. found that grouping the tweets
containing the same hashtags may lead to a
significant improvement of LDA model performance.14
RELATED WORKS – TOPIC
ANALYSIS (2/5)
• Qiu et. al. proposed to jointly modeling topics of
tweets and their associated posting behaviors
• Zhao et. al. proposed TwitterLDA topic model, which
is considered as state-of-the-art topic model for
microblogging data
15
RELATED WORKS – TOPIC
ANALYSIS (3/5)
• TwitterLDA is a variant of LDA, which :
a) documents are formed by aggregating tweets
posted by the same users
b) a single background topic is assumed
c) there is only one common topic for all words in
each tweet
d) each word in a tweet is generated from either the
background topic or the user’s topic.
16
RELATED WORKS – TOPIC
ANALYSIS (4/5)
17
RELATED WORKS – TOPIC
ANALYSIS (5/5)
18
• TwitterLDA model however does not consider
multiple background topics, and impractically
assume that all users have the same dependency on
the unique background topic (as the paramter μ is
common for all the users).
RELATED WORKS – COMMUNITY
ANALYSIS (1/3)
19
• Most of the early works on community analysis in
social networks are finding social communities
based on social links among the users.
• Newman discover social communities by finding
a network partition that maximizes a measure of
“compactness” in community structure called
modularity
RELATED WORKS – COMMUNITY
ANALYSIS (2/3)
20
• Airoldi et. al. proposed a statistical mixed
membership model, and works on finding topical
communities based on user generated content, and
users’ attributes and interest affiliations
• Ding et. al. conducted an empirical study showing
that social community structure of a social network
may significantly be different from topical
communities discovered from the same network
RELATED WORKS – COMMUNITY
ANALYSIS (3/3)
21
• Most of existing works do not differentiate users’
personal interests from those of topical communities,
assuming that a user’s topical interests is
determined purely based on her topical communities’
interests.
• This assumption is not practical since microblog
users express interest in a vast variety of topics of
daily life, and their interests are therefore not always
determined by their topical communities.
TCPI - ASSUMPTION (1/2)
22
• Users generate content topically
• For each user, there is always an underlying topic
explaining content of the every tweet she posts.
TCPI - ASSUMPTION (2/2)
23
• Users generate content according either to their
personal interests or some topical communities.
• While different users generally have different
personal topical interests, their generated content
also share some common topics of the topical
communities the users belong to.
TCPI - GENERATIVE PROCESS (1/5)
24
• User generated tweet from a vocabulary
V
• Has K latent topics
• Each topic k has a multinomial
distribution φk over the vocabulary V
• There are C topical communities
• Each community c has a multinomial
distribution σc over the K topics
• Each user u has a personal topic
distribution θu over the K topics and a
community distribution πu over the C
topical communities
TCPI - GENERATIVE PROCESS (2/5)
25
• Each user has a dependence
distribution μu which is a Bernoulli
distribution indicating how likely the
user tweets based on her own personal
interests ( 𝝁 𝒖
𝟎
) or based on the topical
communities ( 𝝁 𝒖
𝟏 = 1−𝝁 𝒖
𝟎 )
• Lastly, we assume that θu, πu, σ, and φ
have Dirichlet priors α, τ, η, and β
respectively, while μu has Beta prior ρ.
TCPI - GENERATIVE PROCESS (3/5)
26
• To generate a tweet t for user u
• we first flip a biased coin 𝒚 𝒕
𝒖
(whose
bias to head up is 𝝁 𝒖
𝟎
) to decide if the
tweet is based on u’s personal interests,
or based on one of the topical
communities u belongs to.
• If 𝒚 𝒕
𝒖
= 0, we then choose the topic zt
for the tweet according to u’s topic
distribution θu.
TCPI - GENERATIVE PROCESS (4/5)
27
• If 𝒚 𝒕
𝒖
= 1, we first choose a topical
community c according to u’s
community distribution πu, then we
choose zt according to the chosen
community’s topic distribution σc.
Assuming that each tweet has only one
topic, once the topic zt is chosen,
words in t are then chosen according to
the topic’s word distribution φzt .
TCPI - GENERATIVE PROCESS (5/5)
28
TCPI - MODEL LEARNING
29
• Consider a set of microblogging users together with
their posted tweets, we now present the algorithm
for performing inference in the TCPI model.
• We use U to denote the number of users and use W
to denote the number of words in the tweet
vocabulary V. We denote the set of all posted tweets
in the dataset by T.
TCPI - MODEL LEARNING
30
• For each user ui, we denote her j-th tweet by 𝒕𝒋
𝒊
. For
each posted tweet 𝒕𝒋
𝒊
, we denote N words in the
tweet by 𝒘 𝟏
𝒊𝒋
, · · · , 𝒘 𝒏 𝒋
𝒊
𝒊𝒋
respectively, and we
denote the tweet’s topic, coin, and topical
community (if exists) by 𝒛𝒋
𝒊
, 𝒚𝒋
𝒊
, and 𝒄𝒋
𝒊
respectively.
Lastly, we denote the bag-of-topics, bag-of-coins,
and bag-of-topical communities of all the posted
tweets in the dataset by Z, Y, and C respectively.
TCPI - MODEL LEARNING
31
• Due to the intractability of LDA-based models, we
make use of sampling method in learning and
estimating the parameters in the TCPI model.
• Use a collapsed Gibbs sampler to iteratively and
jointly sample the latent coin and latent topical
community, and sample latent topic of every posted
tweet as follows.
32
• For each posted tweet 𝒕𝒋
𝒊
, the j-th tweet posted by user ui
𝒀− 𝒕 𝒋
𝒊 , 𝑪− 𝒕 𝒋
𝒊, 𝒁− 𝒕 𝒋
𝒊 as bag-of-coins, bag-of-topical communities and bag-of-
topics of all other posted tweets in the dataset except the tweet 𝒕𝒋
𝒊
.
• The coin 𝒚𝒋
𝒊
and the topical community 𝒄𝒋
𝒊
of 𝒕𝒋
𝒊
are jointly sampled
according to equations in Figure 2
33
• ny(c,u,C) records the number of times the coin y is observed in the
set of tweets of user u for the bag-of-coins Y.
• nzu(z,u,Z) records the number of times the topic z is observed in the
set of tweets of user u for the bag of topics Z
• nzc(z, c, Z, C) records the number of times the topic z is observed in
the set of tweets that are tweeted based on the topical community c
by any user for the bag-of-topics Z and the bag-of-topical
communities C
• ncu(c,u,C) records the number of times the topical community c is
34
• The topic 𝒛𝒋
𝒊
of 𝒕𝒋
𝒊
is sampled according to equations in Figure 3.
• Note that when y𝒋
𝒊
= 0, we do not have to sample c𝒋
𝒊
, and the current c𝒋
𝒊
(if exists) will be discarded.
35
• nzu(z,u,Z) records the number of times the topic z is observed in the
set of tweets of user u for the bag of topics Z
• nzc(z, c, Z, C) records the number of times the topic z is observed in
the set of tweets that are tweeted based on the topical community c
by any user for the bag-of-topics Z and the bag-of-topical
communities C
• nw(w, z, T , Z) records the number of times the word w is observed in
the topic z for the set of tweets T and the bag-of-topics Z.
TCPI - MODEL LEARNING
36
1. the first term is proportional to the probability that the coin 0 is
generated given the priors and (current) values of all other latent
variables (i.e., the coins, topical communities (if exist), and topics of
all other tweets)
2. the second term is proportional to the probability that the (current)
topic 𝒛𝒋
𝒊
is generated given the priors, (current) values of all other
TCPI - MODEL LEARNING
37
1. the first term is proportional to the probability that the coin 1 is generated
given the priors and (current) values of all other latent variables
2. the second term is proportional to the probability that the topical
community c is generated given the priors, (current) values of all other
latent variables, and the chosen coin
3. the third term is proportional to the probability that the (current) topic 𝒛𝒋
𝒊
is
generated given the priors, (current) values of all other latent variables, and
the chosen coin as well as the chosen community.
TCPI - MODEL LEARNING
38
The terms in the right hand side of Equations 3 and 4
respectively have the similar meaning with those of
Equations 1 and 2.
TCPI - SPARSITY REGULARIZATION
(1/12)
39
• We want topical communities’ topic
distributions and users’ topic distributions to
be skewed on different topics, and topical
communities’ topic distribution to be also
skewed on different topics.
TCPI - SPARSITY REGULARIZATION
(2/12)
40
In estimating parameters in the TCPI model, we
need to obtain sparsity in the following distribution :
• Topic specific coin distribution p(y|z) :
To ensure that each topic z is mostly covered by either
users’ personal interests or topical communities.
• Topic specific topical community distribution p(c|z) :
To ensure that each topic z is mostly covered by one or
only a few topical communities.
To obtain the sparsity mentioned above, we use the pseudo-
observed variable based regularization technique proposed
by Balasubramanyan et. al. as follows.
TCPI - SPARSITY REGULARIZATION
(3/12)
41
Topic Specific Coin Distribution Regularization
Since the topic specific coin distributions are
determined by both coin and community joint
sampling and topic sampling steps, we regularize
both these two steps to bias the distributions to
expected sparsity.
TCPI - SPARSITY REGULARIZATION
(4/12)
42
In Coin and Topical Community Joint Sampling Steps.
In each coin and topical community sampling step for
the tweet 𝒕𝒋
𝒊
, we multiply the right hand side of
equations in Figure 2 with a corresponding
regularization term RtopCoin-C&C(y| 𝒛𝒋
𝒊
) which is
computed based on empirical entropy of p(y| 𝒛𝒋
𝒊
) as in
Equation 5.
empirical entropy of p( y’| 𝒛𝒋
𝒊
) when
𝒚𝒋
𝒊
= y
Expected mean of entropy of
p(y|z),
pre-defined
Expected variance of entropy of p(y|z), pre-
defined
TCPI - SPARSITY REGULARIZATION
(5/12)
43
In Topic Sampling Steps.
In each topic sampling step for the tweet 𝒕𝒋
𝒊
, we
multiply the right hand side of equations in Figure 3
with a corresponding regularization term RtopCoin-
Topic(z| 𝒕𝒋
𝒊
) which is computed based on empirical
entropy of p(y|z) as in Equation 6.empirical entropy of p( y|z’) when 𝒛𝒋
𝒊
= z
Expected mean of entropy of
p(y|z)
Expected variance of entropy of
p(y|z)
TCPI - SPARSITY REGULARIZATION
(6/12)
44
With a low expected mean μtopCoin, these
regularization terms:
(1) increase weight for values of y, c, and z that give
lower empirical entropy of p(y|z), and hence
increasing the sparsity of these distributions
(2) decrease weight for values of y, c, and z that give
higher empirical entropy of p(y|z), and hence
decreasing the sparsity of these distributions.
TCPI - SPARSITY REGULARIZATION
(7/12)
45
Topic Specific Topical Community Distribution
Regularization
Since the topic specific topical community
distributions are determined by both coin and topical
community joint sampling and topic sampling steps,
we regularize both these two steps to bias the
distributions to expected sparsity.
TCPI - SPARSITY REGULARIZATION
(8/12)
46
In Coin and Topical Community Joint Sampling Steps.
In each coin and topical community sampling step for
the tweet 𝒕𝒋
𝒊
, we also multiply the right hand side of
equations in Figure 2 with a corresponding
regularization term RtopComm-C&C(y,c| 𝒛𝒋
𝒊
) which is
computed based on empirical entropy of p(c′| 𝒛𝒋
𝒊
) as in
Equation 7.
empirical entropy of p( c’| 𝒛𝒋
𝒊
) when
𝒚𝒋
𝒊
= y and 𝒄𝒋
𝒊
= c
Expected mean of entropy of
p(c|z),
pre-defined
Expected variance of entropy of p(c|z), pre-
defined
TCPI - SPARSITY REGULARIZATION
(9/12)
47
In Topic Sampling Steps.
In each topic sampling step for the tweet 𝒕𝒋
𝒊
, we also
multiply the right hand side of equations in Figure 3
with a corresponding regularization term RtopComm-
Topic(z| 𝒕𝒋
𝒊
) which is computed based on empirical
entropy of p(c|z) as in Equation 8empirical entropy of p( c|z’) when 𝒛𝒋
𝒊
= z
Expected mean of entropy of
p(c|z)
Expected variance of entropy of
p(c|z)
TCPI - SPARSITY REGULARIZATION
(10/12)
48
With a low expected mean μtopCo 𝒎𝒎
, these
regularization terms:
(1) increase weight for values of y, c, and z that give
lower empirical entropy of p(c|z), and hence
increasing the sparsity of these distributions
(2) decrease weight for values of y, c, and z that give
higher empirical entropy of p(c|z), and hence
decreasing the sparsity of these distributions.
TCPI - SPARSITY REGULARIZATION
(11/12)
49
Setting
μtopCo𝒊𝒏
= μtopCo 𝒎𝒎
= 0, σtopCo𝒊𝒏
= 0.3, σtopCo 𝒎𝒎
= 0.5
And set symmetric Dirichlet hyperparameters :
α = 50/K, β = 0.01, ρ = 2, τ = 1/C, and η = 50/K
TCPI - SPARSITY REGULARIZATION
(12/12)
50
Train the model with 600 iterations of Gibbs
sampling.
Took 25 samples with a gap of 20 iterations in the
last 500 iterations to estimate all the hidden variables.
EXPERIMENTAL EVALUATION –
DATASET (1/2)
51
Using snowball sampling :
SE Dataset. This dataset is collected from a set of
Twitter users who are interested in technology, and
particularly in software development from August 1st
to October 31st, 2011.
Two-Week Dataset. A large corpus of tweets collected
just before the 2012 US presidential election from
August 25th to September 7th, 2012 .
EXPERIMENTAL EVALUATION –
DATASET (2/2)
52
1. Removed stopwords from the tweets
2. Filtered out tweets with less than 3 non stop-
words.
3. Excluded users with less than 50 (remaining)
tweets.
EXPERIMENTAL EVALUATION -
EVALUATION METRICS
53
• Compare TCPI with TwitterLDA model.
• Adopt likelihood and perplexity for evaluating the
two models.
• For each user, we randomly selected 90% of tweets
of the user to form a training set, and use the
remaining 10% of the tweets as the test set
EXPERIMENTAL EVALUATION -
PERFORMANCE COMPARISION
54
(1) TCPI significantly
outperforms TwitterLDA in
topic modeling task
(2) TCPI is robust against the
number of topical
communities,
as its performance does not
significantly change as we
increase the number of the
communities from 1 to 5
EXPERIMENTAL EVALUATION -
BACKGROUND TOPICS AND TOPICAL
COMMUNITIES ANALYSIS (1/6)
55
Considering
1. both time and space complexities
2. it is not practical to expect a large number of
topics falling in topical communities,
we set the number of the topical communities in TCPI
model to 3, and set the number of topics in both
models to 80.
EXPERIMENTAL EVALUATION -
BACKGROUND TOPICS AND TOPICAL
COMMUNITIES ANALYSIS (2/6)
56
EXPERIMENTAL EVALUATION -
BACKGROUND TOPICS AND TOPICAL
COMMUNITIES ANALYSIS (3/6)
57
• the topic’s top words
are the words having
the highest likelihoods
given the topic
• the topic’s top tweets
are the tweets having
the lowest perplexities
given the topic
EXPERIMENTAL EVALUATION -
BACKGROUND TOPICS AND TOPICAL
COMMUNITIES ANALYSIS (4/6)
58
• The background topic found by TwitterLDA model
is not sematically clear
• The topical communitiess and their extreme topics
found by TCPI model are both semantically clear
and reasonable.
• Agrees with the findings by Zhao et. al. that people
also use Twitter for gathering and sharing useful
information relevant to their profession.
EXPERIMENTAL EVALUATION -
BACKGROUND TOPICS AND TOPICAL
COMMUNITIES ANALYSIS (5/6)
59
EXPERIMENTAL EVALUATION -
BACKGROUND TOPICS AND TOPICAL
COMMUNITIES ANALYSIS (6/6)
60
CONCLUSION
61
• Propose a novel topic model called TCPI for
simultaneously modeling topical communities and
users’ topical interests in microblogging data. Our
model differentiates users’ personal interests from
their topical communities while learning both the
two set of latent factors at the same time.
• Also report experiments on two Twitter datasets
showing the effectiveness of the proposed model.
TCPI is shown to outperform TwitterLDA, another
state-of- the-art topic model for modeling tweet
generation.
CONCLUSION – FUTURE WORKS
62
• To consider the scalability of the proposed model.
Possible solutions for scaling up the model are
approximated and distributed implementations of
Gibbs sampling procedures, and stale synchronous
parallel implementation of variational inference
procedures.
• It’s potentially helpful to incorporate prior
knowledge into the proposed model. Examples of
the prior knowledge are topic indicative feature,
and groundtruth community labels for some users.
Q&A!
63

Contenu connexe

Tendances

A research paper on Twitter_Intrinsic versus image related utility in social ...
A research paper on Twitter_Intrinsic versus image related utility in social ...A research paper on Twitter_Intrinsic versus image related utility in social ...
A research paper on Twitter_Intrinsic versus image related utility in social ...
Interskale Digital Marketing and Consulting Pvt. Ltd.
 
A Question of Complexity - Measuring the Maturity of Online Enquiry Communities
A Question of Complexity - Measuring the Maturity of Online Enquiry CommunitiesA Question of Complexity - Measuring the Maturity of Online Enquiry Communities
A Question of Complexity - Measuring the Maturity of Online Enquiry Communities
Gregoire Burel
 

Tendances (20)

Exploring Social Media with NodeXL
Exploring Social Media with NodeXL Exploring Social Media with NodeXL
Exploring Social Media with NodeXL
 
Taxonomy and survey of community
Taxonomy and survey of communityTaxonomy and survey of community
Taxonomy and survey of community
 
KASW'08 - Invited Talk
KASW'08 - Invited TalkKASW'08 - Invited Talk
KASW'08 - Invited Talk
 
Simplifying Social Network Diagrams
Simplifying Social Network Diagrams Simplifying Social Network Diagrams
Simplifying Social Network Diagrams
 
Mathematics and Social Networks
Mathematics and Social NetworksMathematics and Social Networks
Mathematics and Social Networks
 
A research paper on Twitter_Intrinsic versus image related utility in social ...
A research paper on Twitter_Intrinsic versus image related utility in social ...A research paper on Twitter_Intrinsic versus image related utility in social ...
A research paper on Twitter_Intrinsic versus image related utility in social ...
 
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
 
Centrality in Time- Dependent Networks
Centrality in Time- Dependent NetworksCentrality in Time- Dependent Networks
Centrality in Time- Dependent Networks
 
02 Descriptive Statistics (2017)
02 Descriptive Statistics (2017)02 Descriptive Statistics (2017)
02 Descriptive Statistics (2017)
 
10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies
 
11 Keynote (2017)
11 Keynote (2017)11 Keynote (2017)
11 Keynote (2017)
 
Conversation graphs in Online Social Media
Conversation graphs in Online Social MediaConversation graphs in Online Social Media
Conversation graphs in Online Social Media
 
Current trends of opinion mining and sentiment analysis in social networks
Current trends of opinion mining and sentiment analysis in social networksCurrent trends of opinion mining and sentiment analysis in social networks
Current trends of opinion mining and sentiment analysis in social networks
 
01 Network Data Collection (2017)
01 Network Data Collection (2017)01 Network Data Collection (2017)
01 Network Data Collection (2017)
 
03 Communities in Networks (2017)
03 Communities in Networks (2017)03 Communities in Networks (2017)
03 Communities in Networks (2017)
 
A Question of Complexity - Measuring the Maturity of Online Enquiry Communities
A Question of Complexity - Measuring the Maturity of Online Enquiry CommunitiesA Question of Complexity - Measuring the Maturity of Online Enquiry Communities
A Question of Complexity - Measuring the Maturity of Online Enquiry Communities
 
CSE509 Lecture 6
CSE509 Lecture 6CSE509 Lecture 6
CSE509 Lecture 6
 
Mining and analyzing social media part 2 - hicss47 tutorial - dave king
Mining and analyzing social media   part 2 - hicss47 tutorial - dave kingMining and analyzing social media   part 2 - hicss47 tutorial - dave king
Mining and analyzing social media part 2 - hicss47 tutorial - dave king
 
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures
 

Similaire à On Joint Modeling of Topical Communities and Personal Interest in Microblogs

Ieml social recommendersystems
Ieml social recommendersystemsIeml social recommendersystems
Ieml social recommendersystems
Antonio Medina
 

Similaire à On Joint Modeling of Topical Communities and Personal Interest in Microblogs (20)

Measuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMeasuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online Communities
 
Mapping social networks on a new communication ecosystem
Mapping social networks on a new communication ecosystemMapping social networks on a new communication ecosystem
Mapping social networks on a new communication ecosystem
 
Slides ecir2016
Slides ecir2016Slides ecir2016
Slides ecir2016
 
ICDE2013勉強会 Session 19: Social Media II
ICDE2013勉強会 Session 19: Social Media IIICDE2013勉強会 Session 19: Social Media II
ICDE2013勉強会 Session 19: Social Media II
 
Swdm15
Swdm15Swdm15
Swdm15
 
Notes on mining social media updated
Notes on mining social media updatedNotes on mining social media updated
Notes on mining social media updated
 
Who are the top influencers and what characterizes them?
Who are the top influencers and what characterizes them?Who are the top influencers and what characterizes them?
Who are the top influencers and what characterizes them?
 
Topic Lifecycle on Social Networks
Topic Lifecycle on Social NetworksTopic Lifecycle on Social Networks
Topic Lifecycle on Social Networks
 
Analyzing User Modeling on Twitter for Personalized News Recommendations
Analyzing User Modeling on Twitter for Personalized News RecommendationsAnalyzing User Modeling on Twitter for Personalized News Recommendations
Analyzing User Modeling on Twitter for Personalized News Recommendations
 
Scalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsScalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on Microblogs
 
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
 
New Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter DataNew Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter Data
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
ECSM 2015 R4L
ECSM 2015 R4LECSM 2015 R4L
ECSM 2015 R4L
 
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
 
Ieml social recommendersystems
Ieml social recommendersystemsIeml social recommendersystems
Ieml social recommendersystems
 
Introduction to Social Media and Social Networks.pdf
Introduction to Social Media and Social Networks.pdfIntroduction to Social Media and Social Networks.pdf
Introduction to Social Media and Social Networks.pdf
 
Characterizing microblogs
Characterizing microblogsCharacterizing microblogs
Characterizing microblogs
 
Real time twitter trend mining system – rt2 m
Real time twitter trend mining system – rt2 mReal time twitter trend mining system – rt2 m
Real time twitter trend mining system – rt2 m
 

Plus de PC LO

SCM_B2B
SCM_B2BSCM_B2B
SCM_B2B
PC LO
 
Ubuntu
UbuntuUbuntu
Ubuntu
PC LO
 
Travelution
TravelutionTravelution
Travelution
PC LO
 
社團整合系統
社團整合系統社團整合系統
社團整合系統
PC LO
 
禿窄痘胖醜_期中
禿窄痘胖醜_期中禿窄痘胖醜_期中
禿窄痘胖醜_期中
PC LO
 

Plus de PC LO (16)

Active learning from crowds
Active learning from crowdsActive learning from crowds
Active learning from crowds
 
Task oriented word embedding for text classification
Task oriented word embedding for text classificationTask oriented word embedding for text classification
Task oriented word embedding for text classification
 
Chinese liwc lexicon expansion via hierarchical classification of word embedd...
Chinese liwc lexicon expansion via hierarchical classification of word embedd...Chinese liwc lexicon expansion via hierarchical classification of word embedd...
Chinese liwc lexicon expansion via hierarchical classification of word embedd...
 
ext mining for the Vaccine Adverse Event Reporting System: medical text class...
ext mining for the Vaccine Adverse Event Reporting System: medical text class...ext mining for the Vaccine Adverse Event Reporting System: medical text class...
ext mining for the Vaccine Adverse Event Reporting System: medical text class...
 
Sentiment analysis and opinion mining Ch.7
Sentiment analysis and opinion mining Ch.7Sentiment analysis and opinion mining Ch.7
Sentiment analysis and opinion mining Ch.7
 
Campclass
CampclassCampclass
Campclass
 
MIS報告 供應鏈管理
MIS報告 供應鏈管理MIS報告 供應鏈管理
MIS報告 供應鏈管理
 
資料庫期末報告
資料庫期末報告資料庫期末報告
資料庫期末報告
 
新事業期末報告
新事業期末報告新事業期末報告
新事業期末報告
 
ERP 期末報告
ERP 期末報告ERP 期末報告
ERP 期末報告
 
User Acceptance of Information Technology
User Acceptance of Information TechnologyUser Acceptance of Information Technology
User Acceptance of Information Technology
 
SCM_B2B
SCM_B2BSCM_B2B
SCM_B2B
 
Ubuntu
UbuntuUbuntu
Ubuntu
 
Travelution
TravelutionTravelution
Travelution
 
社團整合系統
社團整合系統社團整合系統
社團整合系統
 
禿窄痘胖醜_期中
禿窄痘胖醜_期中禿窄痘胖醜_期中
禿窄痘胖醜_期中
 

Dernier

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Dernier (20)

Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

On Joint Modeling of Topical Communities and Personal Interest in Microblogs

  • 1. ON JOINT MODELING OF TOPICAL COMMUNITIES AND PERSONAL INTEREST IN MOCROBLOGS Tuan-Anh Hoang and Ee- Peng Lim Living Analytics Research Centre Singapore Management University Social Informatics, 6th International Conference, SocInfo2014 1
  • 2. OUTLINE 2 • Abstract • Introduction • Topical Community and personal interest model • Experimental Evaluation • Conclusion
  • 3. ABSTRACT • Propose the topical communities and personal interest (TCPI) model for simultaneously modeling topics, topical communities, and users’ topical interests in microblogging data. • Experiments on two Twitter datasets show that TCPI can effectively mine the representative topics for each topical community. • Also demonstrate that TCPI significantly outperforms other state-of-the-art topic models in the modeling tweet generation task. 3
  • 4. INTRODUCTION (1/7) • Microblogging sites such as Twitter allow users to publish short messages, which are called tweets • Empirical and user studies on microblog usage have showed that users may tweet about either their personal topics or background topics 4
  • 5. INTRODUCTION (2/7) • Personal topics : covers individual interests of the users • Background topics : is the interest shared by users in topical communities and they emerge when users in the same community tweet about common interests. they are thus the results of interests of the topical communities. 5
  • 6. INTRODUCTION (3/7) • Previous works on modeling background topics in social media as well as in general document corpuses. • However, most of these works model a single background topic or a distribution of background topics. • We instead consider the existence of multiple topical communities, each with a different background topic distribution. 6
  • 7. INTRODUCTION (4/7) • A user who is associated with a topical community will therefore adopt topics from the interest of the community. • Hence, when modeling users on social media, we have to consider both the user’s personal interests and his topical communities. 7
  • 8. INTRODUCTION (5/7) • A way to identify the topical communities is first performing topic modeling using one of existing models (e.g., LDA) to find out topical interests of the users, then assign the most common topics of all the users to be the topical communities’ topics. 8
  • 9. INTRODUCTION (6/7) • Such an approach however does not allow us to • distinguish between multiple topical communities, • allow each topical community to have multiple topics. • Quantify the degree in which the user depends on topical communities 9
  • 10. INTRODUCTION (7/7) • Therefore propose to jointly model user topical interests and topical communities’ interests in a same framework where each user has a parameter controlling her bias toward generating content based on her own interests or based on the topical communities 10
  • 11. INTRODUCTION – CONTRIBUTION (1/2) 1. Propose a probabilistic graphical model, called Topical Communities and Personal Interest model (abbreviated as TCPI), for modeling topics and topical communities, as well as modeling users’ topical interests and their dependency on the topical communities in generating content. 2. Develop a sampling method to infer the model’s parameters, and a regularization technique to bias the model to learn more semantically clear topical communities. 11
  • 12. INTRODUCTION – CONTRIBUTION (2/2) 3. We apply TCPI model on two Twitter datasets and show that it significantly outperforms other state- of-the-art models in modeling tweet generation task. 4. An empirical analysis of topics and topical communities for the two datasets has been conducted to demonstrate the efficacy of the TCPI model. 12
  • 13. RELATED WORKS • The works on analyzing topics in microblogs • Works on analyzing communities in social networks 13
  • 14. RELATED WORKS – TOPIC ANALYSIS (1/5) • Michelson et. al. first examined topical interests of Twitter users by analyzing the named entities mentioned in their tweets. • Hong et. al. then conducted an empirical study on different ways of performing topic modeling on tweets using the original LDA model and Author- topic model • Mehrotra et. al. found that grouping the tweets containing the same hashtags may lead to a significant improvement of LDA model performance.14
  • 15. RELATED WORKS – TOPIC ANALYSIS (2/5) • Qiu et. al. proposed to jointly modeling topics of tweets and their associated posting behaviors • Zhao et. al. proposed TwitterLDA topic model, which is considered as state-of-the-art topic model for microblogging data 15
  • 16. RELATED WORKS – TOPIC ANALYSIS (3/5) • TwitterLDA is a variant of LDA, which : a) documents are formed by aggregating tweets posted by the same users b) a single background topic is assumed c) there is only one common topic for all words in each tweet d) each word in a tweet is generated from either the background topic or the user’s topic. 16
  • 17. RELATED WORKS – TOPIC ANALYSIS (4/5) 17
  • 18. RELATED WORKS – TOPIC ANALYSIS (5/5) 18 • TwitterLDA model however does not consider multiple background topics, and impractically assume that all users have the same dependency on the unique background topic (as the paramter μ is common for all the users).
  • 19. RELATED WORKS – COMMUNITY ANALYSIS (1/3) 19 • Most of the early works on community analysis in social networks are finding social communities based on social links among the users. • Newman discover social communities by finding a network partition that maximizes a measure of “compactness” in community structure called modularity
  • 20. RELATED WORKS – COMMUNITY ANALYSIS (2/3) 20 • Airoldi et. al. proposed a statistical mixed membership model, and works on finding topical communities based on user generated content, and users’ attributes and interest affiliations • Ding et. al. conducted an empirical study showing that social community structure of a social network may significantly be different from topical communities discovered from the same network
  • 21. RELATED WORKS – COMMUNITY ANALYSIS (3/3) 21 • Most of existing works do not differentiate users’ personal interests from those of topical communities, assuming that a user’s topical interests is determined purely based on her topical communities’ interests. • This assumption is not practical since microblog users express interest in a vast variety of topics of daily life, and their interests are therefore not always determined by their topical communities.
  • 22. TCPI - ASSUMPTION (1/2) 22 • Users generate content topically • For each user, there is always an underlying topic explaining content of the every tweet she posts.
  • 23. TCPI - ASSUMPTION (2/2) 23 • Users generate content according either to their personal interests or some topical communities. • While different users generally have different personal topical interests, their generated content also share some common topics of the topical communities the users belong to.
  • 24. TCPI - GENERATIVE PROCESS (1/5) 24 • User generated tweet from a vocabulary V • Has K latent topics • Each topic k has a multinomial distribution φk over the vocabulary V • There are C topical communities • Each community c has a multinomial distribution σc over the K topics • Each user u has a personal topic distribution θu over the K topics and a community distribution πu over the C topical communities
  • 25. TCPI - GENERATIVE PROCESS (2/5) 25 • Each user has a dependence distribution μu which is a Bernoulli distribution indicating how likely the user tweets based on her own personal interests ( 𝝁 𝒖 𝟎 ) or based on the topical communities ( 𝝁 𝒖 𝟏 = 1−𝝁 𝒖 𝟎 ) • Lastly, we assume that θu, πu, σ, and φ have Dirichlet priors α, τ, η, and β respectively, while μu has Beta prior ρ.
  • 26. TCPI - GENERATIVE PROCESS (3/5) 26 • To generate a tweet t for user u • we first flip a biased coin 𝒚 𝒕 𝒖 (whose bias to head up is 𝝁 𝒖 𝟎 ) to decide if the tweet is based on u’s personal interests, or based on one of the topical communities u belongs to. • If 𝒚 𝒕 𝒖 = 0, we then choose the topic zt for the tweet according to u’s topic distribution θu.
  • 27. TCPI - GENERATIVE PROCESS (4/5) 27 • If 𝒚 𝒕 𝒖 = 1, we first choose a topical community c according to u’s community distribution πu, then we choose zt according to the chosen community’s topic distribution σc. Assuming that each tweet has only one topic, once the topic zt is chosen, words in t are then chosen according to the topic’s word distribution φzt .
  • 28. TCPI - GENERATIVE PROCESS (5/5) 28
  • 29. TCPI - MODEL LEARNING 29 • Consider a set of microblogging users together with their posted tweets, we now present the algorithm for performing inference in the TCPI model. • We use U to denote the number of users and use W to denote the number of words in the tweet vocabulary V. We denote the set of all posted tweets in the dataset by T.
  • 30. TCPI - MODEL LEARNING 30 • For each user ui, we denote her j-th tweet by 𝒕𝒋 𝒊 . For each posted tweet 𝒕𝒋 𝒊 , we denote N words in the tweet by 𝒘 𝟏 𝒊𝒋 , · · · , 𝒘 𝒏 𝒋 𝒊 𝒊𝒋 respectively, and we denote the tweet’s topic, coin, and topical community (if exists) by 𝒛𝒋 𝒊 , 𝒚𝒋 𝒊 , and 𝒄𝒋 𝒊 respectively. Lastly, we denote the bag-of-topics, bag-of-coins, and bag-of-topical communities of all the posted tweets in the dataset by Z, Y, and C respectively.
  • 31. TCPI - MODEL LEARNING 31 • Due to the intractability of LDA-based models, we make use of sampling method in learning and estimating the parameters in the TCPI model. • Use a collapsed Gibbs sampler to iteratively and jointly sample the latent coin and latent topical community, and sample latent topic of every posted tweet as follows.
  • 32. 32 • For each posted tweet 𝒕𝒋 𝒊 , the j-th tweet posted by user ui 𝒀− 𝒕 𝒋 𝒊 , 𝑪− 𝒕 𝒋 𝒊, 𝒁− 𝒕 𝒋 𝒊 as bag-of-coins, bag-of-topical communities and bag-of- topics of all other posted tweets in the dataset except the tweet 𝒕𝒋 𝒊 . • The coin 𝒚𝒋 𝒊 and the topical community 𝒄𝒋 𝒊 of 𝒕𝒋 𝒊 are jointly sampled according to equations in Figure 2
  • 33. 33 • ny(c,u,C) records the number of times the coin y is observed in the set of tweets of user u for the bag-of-coins Y. • nzu(z,u,Z) records the number of times the topic z is observed in the set of tweets of user u for the bag of topics Z • nzc(z, c, Z, C) records the number of times the topic z is observed in the set of tweets that are tweeted based on the topical community c by any user for the bag-of-topics Z and the bag-of-topical communities C • ncu(c,u,C) records the number of times the topical community c is
  • 34. 34 • The topic 𝒛𝒋 𝒊 of 𝒕𝒋 𝒊 is sampled according to equations in Figure 3. • Note that when y𝒋 𝒊 = 0, we do not have to sample c𝒋 𝒊 , and the current c𝒋 𝒊 (if exists) will be discarded.
  • 35. 35 • nzu(z,u,Z) records the number of times the topic z is observed in the set of tweets of user u for the bag of topics Z • nzc(z, c, Z, C) records the number of times the topic z is observed in the set of tweets that are tweeted based on the topical community c by any user for the bag-of-topics Z and the bag-of-topical communities C • nw(w, z, T , Z) records the number of times the word w is observed in the topic z for the set of tweets T and the bag-of-topics Z.
  • 36. TCPI - MODEL LEARNING 36 1. the first term is proportional to the probability that the coin 0 is generated given the priors and (current) values of all other latent variables (i.e., the coins, topical communities (if exist), and topics of all other tweets) 2. the second term is proportional to the probability that the (current) topic 𝒛𝒋 𝒊 is generated given the priors, (current) values of all other
  • 37. TCPI - MODEL LEARNING 37 1. the first term is proportional to the probability that the coin 1 is generated given the priors and (current) values of all other latent variables 2. the second term is proportional to the probability that the topical community c is generated given the priors, (current) values of all other latent variables, and the chosen coin 3. the third term is proportional to the probability that the (current) topic 𝒛𝒋 𝒊 is generated given the priors, (current) values of all other latent variables, and the chosen coin as well as the chosen community.
  • 38. TCPI - MODEL LEARNING 38 The terms in the right hand side of Equations 3 and 4 respectively have the similar meaning with those of Equations 1 and 2.
  • 39. TCPI - SPARSITY REGULARIZATION (1/12) 39 • We want topical communities’ topic distributions and users’ topic distributions to be skewed on different topics, and topical communities’ topic distribution to be also skewed on different topics.
  • 40. TCPI - SPARSITY REGULARIZATION (2/12) 40 In estimating parameters in the TCPI model, we need to obtain sparsity in the following distribution : • Topic specific coin distribution p(y|z) : To ensure that each topic z is mostly covered by either users’ personal interests or topical communities. • Topic specific topical community distribution p(c|z) : To ensure that each topic z is mostly covered by one or only a few topical communities. To obtain the sparsity mentioned above, we use the pseudo- observed variable based regularization technique proposed by Balasubramanyan et. al. as follows.
  • 41. TCPI - SPARSITY REGULARIZATION (3/12) 41 Topic Specific Coin Distribution Regularization Since the topic specific coin distributions are determined by both coin and community joint sampling and topic sampling steps, we regularize both these two steps to bias the distributions to expected sparsity.
  • 42. TCPI - SPARSITY REGULARIZATION (4/12) 42 In Coin and Topical Community Joint Sampling Steps. In each coin and topical community sampling step for the tweet 𝒕𝒋 𝒊 , we multiply the right hand side of equations in Figure 2 with a corresponding regularization term RtopCoin-C&C(y| 𝒛𝒋 𝒊 ) which is computed based on empirical entropy of p(y| 𝒛𝒋 𝒊 ) as in Equation 5. empirical entropy of p( y’| 𝒛𝒋 𝒊 ) when 𝒚𝒋 𝒊 = y Expected mean of entropy of p(y|z), pre-defined Expected variance of entropy of p(y|z), pre- defined
  • 43. TCPI - SPARSITY REGULARIZATION (5/12) 43 In Topic Sampling Steps. In each topic sampling step for the tweet 𝒕𝒋 𝒊 , we multiply the right hand side of equations in Figure 3 with a corresponding regularization term RtopCoin- Topic(z| 𝒕𝒋 𝒊 ) which is computed based on empirical entropy of p(y|z) as in Equation 6.empirical entropy of p( y|z’) when 𝒛𝒋 𝒊 = z Expected mean of entropy of p(y|z) Expected variance of entropy of p(y|z)
  • 44. TCPI - SPARSITY REGULARIZATION (6/12) 44 With a low expected mean μtopCoin, these regularization terms: (1) increase weight for values of y, c, and z that give lower empirical entropy of p(y|z), and hence increasing the sparsity of these distributions (2) decrease weight for values of y, c, and z that give higher empirical entropy of p(y|z), and hence decreasing the sparsity of these distributions.
  • 45. TCPI - SPARSITY REGULARIZATION (7/12) 45 Topic Specific Topical Community Distribution Regularization Since the topic specific topical community distributions are determined by both coin and topical community joint sampling and topic sampling steps, we regularize both these two steps to bias the distributions to expected sparsity.
  • 46. TCPI - SPARSITY REGULARIZATION (8/12) 46 In Coin and Topical Community Joint Sampling Steps. In each coin and topical community sampling step for the tweet 𝒕𝒋 𝒊 , we also multiply the right hand side of equations in Figure 2 with a corresponding regularization term RtopComm-C&C(y,c| 𝒛𝒋 𝒊 ) which is computed based on empirical entropy of p(c′| 𝒛𝒋 𝒊 ) as in Equation 7. empirical entropy of p( c’| 𝒛𝒋 𝒊 ) when 𝒚𝒋 𝒊 = y and 𝒄𝒋 𝒊 = c Expected mean of entropy of p(c|z), pre-defined Expected variance of entropy of p(c|z), pre- defined
  • 47. TCPI - SPARSITY REGULARIZATION (9/12) 47 In Topic Sampling Steps. In each topic sampling step for the tweet 𝒕𝒋 𝒊 , we also multiply the right hand side of equations in Figure 3 with a corresponding regularization term RtopComm- Topic(z| 𝒕𝒋 𝒊 ) which is computed based on empirical entropy of p(c|z) as in Equation 8empirical entropy of p( c|z’) when 𝒛𝒋 𝒊 = z Expected mean of entropy of p(c|z) Expected variance of entropy of p(c|z)
  • 48. TCPI - SPARSITY REGULARIZATION (10/12) 48 With a low expected mean μtopCo 𝒎𝒎 , these regularization terms: (1) increase weight for values of y, c, and z that give lower empirical entropy of p(c|z), and hence increasing the sparsity of these distributions (2) decrease weight for values of y, c, and z that give higher empirical entropy of p(c|z), and hence decreasing the sparsity of these distributions.
  • 49. TCPI - SPARSITY REGULARIZATION (11/12) 49 Setting μtopCo𝒊𝒏 = μtopCo 𝒎𝒎 = 0, σtopCo𝒊𝒏 = 0.3, σtopCo 𝒎𝒎 = 0.5 And set symmetric Dirichlet hyperparameters : α = 50/K, β = 0.01, ρ = 2, τ = 1/C, and η = 50/K
  • 50. TCPI - SPARSITY REGULARIZATION (12/12) 50 Train the model with 600 iterations of Gibbs sampling. Took 25 samples with a gap of 20 iterations in the last 500 iterations to estimate all the hidden variables.
  • 51. EXPERIMENTAL EVALUATION – DATASET (1/2) 51 Using snowball sampling : SE Dataset. This dataset is collected from a set of Twitter users who are interested in technology, and particularly in software development from August 1st to October 31st, 2011. Two-Week Dataset. A large corpus of tweets collected just before the 2012 US presidential election from August 25th to September 7th, 2012 .
  • 52. EXPERIMENTAL EVALUATION – DATASET (2/2) 52 1. Removed stopwords from the tweets 2. Filtered out tweets with less than 3 non stop- words. 3. Excluded users with less than 50 (remaining) tweets.
  • 53. EXPERIMENTAL EVALUATION - EVALUATION METRICS 53 • Compare TCPI with TwitterLDA model. • Adopt likelihood and perplexity for evaluating the two models. • For each user, we randomly selected 90% of tweets of the user to form a training set, and use the remaining 10% of the tweets as the test set
  • 54. EXPERIMENTAL EVALUATION - PERFORMANCE COMPARISION 54 (1) TCPI significantly outperforms TwitterLDA in topic modeling task (2) TCPI is robust against the number of topical communities, as its performance does not significantly change as we increase the number of the communities from 1 to 5
  • 55. EXPERIMENTAL EVALUATION - BACKGROUND TOPICS AND TOPICAL COMMUNITIES ANALYSIS (1/6) 55 Considering 1. both time and space complexities 2. it is not practical to expect a large number of topics falling in topical communities, we set the number of the topical communities in TCPI model to 3, and set the number of topics in both models to 80.
  • 56. EXPERIMENTAL EVALUATION - BACKGROUND TOPICS AND TOPICAL COMMUNITIES ANALYSIS (2/6) 56
  • 57. EXPERIMENTAL EVALUATION - BACKGROUND TOPICS AND TOPICAL COMMUNITIES ANALYSIS (3/6) 57 • the topic’s top words are the words having the highest likelihoods given the topic • the topic’s top tweets are the tweets having the lowest perplexities given the topic
  • 58. EXPERIMENTAL EVALUATION - BACKGROUND TOPICS AND TOPICAL COMMUNITIES ANALYSIS (4/6) 58 • The background topic found by TwitterLDA model is not sematically clear • The topical communitiess and their extreme topics found by TCPI model are both semantically clear and reasonable. • Agrees with the findings by Zhao et. al. that people also use Twitter for gathering and sharing useful information relevant to their profession.
  • 59. EXPERIMENTAL EVALUATION - BACKGROUND TOPICS AND TOPICAL COMMUNITIES ANALYSIS (5/6) 59
  • 60. EXPERIMENTAL EVALUATION - BACKGROUND TOPICS AND TOPICAL COMMUNITIES ANALYSIS (6/6) 60
  • 61. CONCLUSION 61 • Propose a novel topic model called TCPI for simultaneously modeling topical communities and users’ topical interests in microblogging data. Our model differentiates users’ personal interests from their topical communities while learning both the two set of latent factors at the same time. • Also report experiments on two Twitter datasets showing the effectiveness of the proposed model. TCPI is shown to outperform TwitterLDA, another state-of- the-art topic model for modeling tweet generation.
  • 62. CONCLUSION – FUTURE WORKS 62 • To consider the scalability of the proposed model. Possible solutions for scaling up the model are approximated and distributed implementations of Gibbs sampling procedures, and stale synchronous parallel implementation of variational inference procedures. • It’s potentially helpful to incorporate prior knowledge into the proposed model. Examples of the prior knowledge are topic indicative feature, and groundtruth community labels for some users.