StreamGrid: Summarization of large-scale Events using Topic Modeling and Temporal Analysis

SoMuS 2014 Workshop
ICMR, Glasgow, Scotland, 1 April 2014
StreamGrid: Summarization of Large Scale Events
using Topic Modelling and Temporal Analysis
Manos Schinas, Symeon Papadopoulos, Yiannis Kompatsiaris, Pericles A. Mitkas
Information Technologies Institute (ITI)
Center for Research & Technology Hellas (CERTH)
Department of Electrical & Computer Engineering
Aristotle University of Thessaloniki (AUTh)

#somus2014 #2
Overview
• The Problem
• Existing Approaches
• StreamGrid
• Experimental Study
• Summary – Future Work

#somus2014 #3
Event Summarization
motivation & definition

#somus2014 #4
Large-scale Public Events
• A lot of attendants using social media
• Huge amount of event-related social content
#oscars  4.5M tweets
#sxsw  1.35M tweets
#SB48 (Super Bowl)  24.9M tweets in 4 hours!

#somus2014
Large-scale Public Events
• Long-running events consist of several sub-events,
e.g. 10 days of Sundance Film Festival include
opening and awards ceremonies, screenings etc.
• Many aspects and entities of interest in the context
of an event e.g. films in film festivals, teams in sports
events, etc.
• Many messages can be considered as spam or non-
informative.
• Redundancy due to near-duplicate messages
#5

#somus2014
Event-based Summarization
Produce concise multi-document summaries for a given
event, covering its main aspects.
#6
Event-based
Summarizer
List of all messages
Set of Selected
Messages

#somus2014 #8
Existing Approaches
Radev et al. 2004 (baseline)
• Summary consists of the messages closest to the tf∙idf
centroid of all messages
Shen et al. 2013
• Mixture model to detect sub-events at participant level
• tf∙idf centroid to find a summary of each sub-event
Chakrabarti & Punera 2011
• Hidden Markov Model to obtain a time-based segmentation
• tf∙idf centroid to find a summary of time segment

#somus2014
Existing Approaches
Erkan et al. 2004 (LexRank)
• Graph-based approach to find salient sentences
• Uses centrality of each sentence in a similarity graph
• Adapted for multi-document summarization using
each message as a sentence
• Outperforms naïve centroid-based approach
Shen et al. 2013
• Online clustering algorithm to find sub-events.
• Greedy algorithm for summarization using the
LexRank score of each message.
#9

#somus2014
StreamGrid
approach description
#10

#somus2014
StreamGrid Overview
• Find topics using Latent Dirichlet Allocation (LDA)
• Create a timeline for each topic
• Create StreamGrid structure
• Summarize using StreamGrid
#11

#somus2014
Topic Modeling using LDA
• To work with very short documents (tweets), LDA
needs some kind of message pooling
• Number of topics estimation
– Minimize: (a) total perplexity for a set of test documents
and (b) average textual similarity across topics
#12
Microblog messages
merge
Pooling Schemes
• Time proximity
• Same Author
• Same Hashtag
• Textual similarity
Merged
messages

#somus2014
Topic Modeling using LDA
• Split documents D to Dtrain and Dtest
• Estimate K topics over Dtrain
• Calculate total perplexity of Dtest
#13

#somus2014
StreamGrid Creation
• Assign each message to the topic with the highest
probability p under condition p > pth
(spam messages are discarded)
• Create StreamGrid
#14
time interval j
topic i
cell c(i,j) = {set of messages
associated with topic i, posted
during time interval j }

#somus2014
StreamGrid Creation
• For each cell c(i,j) calculate a merged tf∙idf vector uij
• For each term t calculate the weight:
where tfij(t) is the frequency of t in cell c(i,j)
• For each message m of c(i,j) calculate the weight:
#15

#somus2014
StreamGrid Creation
• Detect active cells of each topic by applying peak
detection on the associated topic timeline.
• Given a topic i and a detected peak in time window
[a,b], all cells c(i,j), a < j < b, are defined as active.
• For the set of active topics A during a time interval j,
calculate a significance score:
#16

#somus2014
StreamGrid Creation
• To get an overall estimation of the importance of
each topic throughout the event we calculate two
measures:
#17

#somus2014
Topic-time Summarization
• Our goal is the generation of a summary of an event
for an arbitrary time frame F=[x1,x2].
• Summary has to meet the following criteria
– As many aspects of the event are covered
– Redundancy due to near duplicate messages are
minimized
• We use a greedy algorithm that selects important
messages from each active topic in F and minimizes
redundancy simultaneously.
#18

#somus2014
• A topic i is active in F if any of the cells contained in F is active.
• The significance score of an active topic i in F is the max
significance score across all time intervals in F.
• The weight W(m,F) of a message m in F is the sum of the
weights in each time interval.
#19
Time frame F’
Active topics in F’
Time frame F
Active topics in F

#somus2014
Topic-time Summarization: Algorithm
Input: StreamGrid, time frame F, summary length L
Output: summary set S
1. Get active topics in F
2. for each active topic select message with highest weight
Mc
3. while |S|<L do
4. for each message m in Mc do
5. calculate score(m)
6. end for
7. Add message with highest score to S and remove
it from Mc
8. end while
#20

#somus2014
• The score of a message m is a combination of its
importance and of the redundancy introduced by its
selection.
• Redundancy is the average textual similarity among
the set of already selected messages S
#21

#somus2014
Experimental Study
Sundance Film Festival 2013
#22

#somus2014
Dataset & Event
Sundance Film Festival
• Two week festival: Jan 15-30, 2013
• Data collection based on Streaming API with the
following parameters:
– hashtags: #sundance, #sundance2013, #sundancefest
– account: @sundancefest
• Total number of tweets: 201,752
• Total number of original tweets: 100,046
#23

#somus2014
Topic Modeling
• Merge messages with
the same hashtag gave
the best results with
respect to perplexity.
• Main trend for perplexity
is to decrease as K
increases.
• Average similarity
between clusters
stabilized for K>200 →
K = 200
#24

#somus2014
Peaky & Persistent Topics
#25

#somus2014
Event Timeline
#26
Awards
ceremony
“Stoker” film by
Chan-wook Park &
“Use Orally as Indicated” film

#somus2014
Selected Timeslots
• Evaluate using two timeslots with high activity.
• The first time frame has a small number of very popular tweets mainly
about two films.
• The second is a more diverse set of tweets.
• A good measure of the quality of a summary is the number of films
covered.
#27
From To Tweets Description
Mon Jan 21 05:00:00
EET 2013
Mon Jan 21 06:00:00
EET 2013
5755 “Stoker” film by
Chan-wook Park &
“Use Orally as
Indicated” film
Sun Jan 27 03:00:00
EET 2013
Sun Jan 27 09:00:00
EET 2013
9009 Awards
ceremony

#somus2014
Baselines
• Random Summarizer: Selects L random tweets.
• Popularity Summarizer: Selects the top L tweets
based on retweet count.
• tf∙idf Summarizer: Uses tf∙idf weight of each tweet
to select top L.
• Cluster-based Summarizer: Creates L clusters using
k-means clustering and selects the highest weighted
message of each cluster.
• LexRank Summarizer: Graph-based method that
assigns a weight on each tweet based on its adjacent
edges.
#28

#somus2014
Timeslot #1 (Stoker & Use Orally as Indicated)
Popularity-based Summarizer
• 5/10 tweets of the summary are related to the Stoker Film → Tends
to cover only a few popular aspects of the event
• Minimizes near-duplicate redundancy, as it uses only the original
tweets.
• "Use Orally as Indicated“ is the second film covered in the summary
(130 RTs)
#29

#somus2014
LexRank Summarizer
• 9/10 tweets of the summary are retweets of a tweet related to “Use
Orally as Indicated” film → A lot of redundancy
• These tweets have high degree centrality, as there are many
connections between them.
tf∙idf Summarizer
• Covers two different films (Stoker, Stuart Hall).
• Many tweets about these films.
#30

#somus2014
StreamGrid Summarizer
• Covers five different films (The Look of Love, Dirty Wars, Before
Midnight, Kill you Darlings, Life according to Sam)
• There are no duplicates or near-duplicates.
• “Stoker” and “Use Orally as Indicated” are not covered!
• A combination of StreamGrid Summarization and Popularity
Summarization could solve this.
#31

#somus2014
Timeslot #2 (Awards Ceremony)
KPI: Number of winning films covered by the summary
• Popularity-based summarizer outperforms all other approaches:
covers 8 films that won any award that night (Afternoon Delight,
Fruitvale, The Spectacular Now, Blood Brothers, Metro Manila, Dirty
Wars, Crystal Fair, Pussy Riot)
• StreamGrid covers 6 films (Computer Chess, Inequality for all,
Fruitvale, Afternoon Delight, In a world, American Promise).
• Only two films in common → Integrate popularity into StreamGrid
to obtain better results.
• LexRank does not cover any of the winning films, but includes this:
'The Canyons' Snubbed By Sundance Film Festival -- Lindsay Lohan
to Blame?
• tf∙idf Summarizer includes three films but none from the winning
ones!
#32

#somus2014
Multimedia Summaries
#33
Popularity-based summary
StreamGrid summary
Is there any systematic-objective
way to evaluate these?

#somus2014
Conclusions & Future Work
#34

#somus2014
Summary
• Topic modeling approach to capture automatically
the main aspects of the event from a large set of
event-related microblogging messages.
• Peak detection on each topic-related timeline to find
active moments of each topic.
• Use of active topic to select a set of representative
messages for an arbitrary time frame.
• Greedy algorithm for the selection of messages with
respect to content coverage and redundancy
reduction.
#35

#somus2014
Future Work
• Real-time version of StreamGrid framework to get
summaries of evolving and continuous social
streams.
• Investigate how different topic modeling techniques
affect the produced summary.
• Find a more systematic way to evaluate summaries
(especially multimedia!).
#36

#somus2014
Thank you!
#37
Questions?

#somus2014
Key References
• Shou, Lidan, et al. "Sumblr: continuous summarization of evolving tweet
streams." Proceedings of the 36th international ACM SIGIR conference on
Research and development in information retrieval. ACM, 2013.
• Marcus, Adam, et al. "Twitinfo: aggregating and visualizing microblogs for
event exploration." Proceedings of the SIGCHI conference on Human
factors in computing systems. ACM, 2011.
• Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet
allocation." the Journal of machine Learning research 3 (2003): 993-1022.
• Erkan, Günes, and Dragomir R. Radev. "LexRank: Graph-based lexical
centrality as salience in text summarization." J. Artif. Intell. Res.(JAIR) 22.1
(2004): 457-479.
#38

StreamGrid: Summarization of large-scale Events using Topic Modeling and Temporal Analysis

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (11)

Similaire à StreamGrid: Summarization of large-scale Events using Topic Modeling and Temporal Analysis

Similaire à StreamGrid: Summarization of large-scale Events using Topic Modeling and Temporal Analysis (20)

Plus de Symeon Papadopoulos

Plus de Symeon Papadopoulos (20)

Dernier

Dernier (20)

StreamGrid: Summarization of large-scale Events using Topic Modeling and Temporal Analysis