SlideShare une entreprise Scribd logo
1  sur  38
SoMuS 2014 Workshop
ICMR, Glasgow, Scotland, 1 April 2014
StreamGrid: Summarization of Large Scale Events
using Topic Modelling and Temporal Analysis
Manos Schinas, Symeon Papadopoulos, Yiannis Kompatsiaris, Pericles A. Mitkas
Information Technologies Institute (ITI)
Center for Research & Technology Hellas (CERTH)
Department of Electrical & Computer Engineering
Aristotle University of Thessaloniki (AUTh)
#somus2014 #2
Overview
• The Problem
• Existing Approaches
• StreamGrid
• Experimental Study
• Summary – Future Work
#somus2014 #3
Event Summarization
motivation & definition
#somus2014 #4
Large-scale Public Events
• A lot of attendants using social media
• Huge amount of event-related social content
#oscars  4.5M tweets
#sxsw  1.35M tweets
#SB48 (Super Bowl)  24.9M tweets in 4 hours!
#somus2014
Large-scale Public Events
• Long-running events consist of several sub-events,
e.g. 10 days of Sundance Film Festival include
opening and awards ceremonies, screenings etc.
• Many aspects and entities of interest in the context
of an event e.g. films in film festivals, teams in sports
events, etc.
• Many messages can be considered as spam or non-
informative.
• Redundancy due to near-duplicate messages
#5
#somus2014
Event-based Summarization
Produce concise multi-document summaries for a given
event, covering its main aspects.
#6
Event-based
Summarizer
List of all messages
Set of Selected
Messages
#somus2014
Related Work
#7
#somus2014 #8
Existing Approaches
Radev et al. 2004 (baseline)
• Summary consists of the messages closest to the tf∙idf
centroid of all messages
Shen et al. 2013
• Mixture model to detect sub-events at participant level
• tf∙idf centroid to find a summary of each sub-event
Chakrabarti & Punera 2011
• Hidden Markov Model to obtain a time-based segmentation
• tf∙idf centroid to find a summary of time segment
#somus2014
Existing Approaches
Erkan et al. 2004 (LexRank)
• Graph-based approach to find salient sentences
• Uses centrality of each sentence in a similarity graph
• Adapted for multi-document summarization using
each message as a sentence
• Outperforms naïve centroid-based approach
Shen et al. 2013
• Online clustering algorithm to find sub-events.
• Greedy algorithm for summarization using the
LexRank score of each message.
#9
#somus2014
StreamGrid
approach description
#10
#somus2014
StreamGrid Overview
• Find topics using Latent Dirichlet Allocation (LDA)
• Create a timeline for each topic
• Create StreamGrid structure
• Summarize using StreamGrid
#11
#somus2014
Topic Modeling using LDA
• To work with very short documents (tweets), LDA
needs some kind of message pooling
• Number of topics estimation
– Minimize: (a) total perplexity for a set of test documents
and (b) average textual similarity across topics
#12
Microblog messages
merge
Pooling Schemes
• Time proximity
• Same Author
• Same Hashtag
• Textual similarity
Merged
messages
#somus2014
Topic Modeling using LDA
• Split documents D to Dtrain and Dtest
• Estimate K topics over Dtrain
• Calculate total perplexity of Dtest
#13
#somus2014
StreamGrid Creation
• Assign each message to the topic with the highest
probability p under condition p > pth
(spam messages are discarded)
• Create StreamGrid
#14
time interval j
topic i
cell c(i,j) = {set of messages
associated with topic i, posted
during time interval j }
#somus2014
StreamGrid Creation
• For each cell c(i,j) calculate a merged tf∙idf vector uij
• For each term t calculate the weight:
where tfij(t) is the frequency of t in cell c(i,j)
• For each message m of c(i,j) calculate the weight:
#15
#somus2014
StreamGrid Creation
• Detect active cells of each topic by applying peak
detection on the associated topic timeline.
• Given a topic i and a detected peak in time window
[a,b], all cells c(i,j), a < j < b, are defined as active.
• For the set of active topics A during a time interval j,
calculate a significance score:
#16
#somus2014
StreamGrid Creation
• To get an overall estimation of the importance of
each topic throughout the event we calculate two
measures:
#17
#somus2014
Topic-time Summarization
• Our goal is the generation of a summary of an event
for an arbitrary time frame F=[x1,x2].
• Summary has to meet the following criteria
– As many aspects of the event are covered
– Redundancy due to near duplicate messages are
minimized
• We use a greedy algorithm that selects important
messages from each active topic in F and minimizes
redundancy simultaneously.
#18
#somus2014
Topic-time Summarization
• A topic i is active in F if any of the cells contained in F is active.
• The significance score of an active topic i in F is the max
significance score across all time intervals in F.
• The weight W(m,F) of a message m in F is the sum of the
weights in each time interval.
#19
Time frame F’
Active topics in F’
Time frame F
Active topics in F
#somus2014
Topic-time Summarization: Algorithm
Input: StreamGrid, time frame F, summary length L
Output: summary set S
1. Get active topics in F
2. for each active topic select message with highest weight
Mc
3. while |S|<L do
4. for each message m in Mc do
5. calculate score(m)
6. end for
7. Add message with highest score to S and remove
it from Mc
8. end while
#20
#somus2014
Topic-time Summarization
• The score of a message m is a combination of its
importance and of the redundancy introduced by its
selection.
• Redundancy is the average textual similarity among
the set of already selected messages S
#21
#somus2014
Experimental Study
Sundance Film Festival 2013
#22
#somus2014
Dataset & Event
Sundance Film Festival
• Two week festival: Jan 15-30, 2013
• Data collection based on Streaming API with the
following parameters:
– hashtags: #sundance, #sundance2013, #sundancefest
– account: @sundancefest
• Total number of tweets: 201,752
• Total number of original tweets: 100,046
#23
#somus2014
Topic Modeling
• Merge messages with
the same hashtag gave
the best results with
respect to perplexity.
• Main trend for perplexity
is to decrease as K
increases.
• Average similarity
between clusters
stabilized for K>200 →
K = 200
#24
#somus2014
Peaky & Persistent Topics
#25
#somus2014
Event Timeline
#26
Awards
ceremony
“Stoker” film by
Chan-wook Park &
“Use Orally as Indicated” film
#somus2014
Selected Timeslots
• Evaluate using two timeslots with high activity.
• The first time frame has a small number of very popular tweets mainly
about two films.
• The second is a more diverse set of tweets.
• A good measure of the quality of a summary is the number of films
covered.
#27
From To Tweets Description
Mon Jan 21 05:00:00
EET 2013
Mon Jan 21 06:00:00
EET 2013
5755 “Stoker” film by
Chan-wook Park &
“Use Orally as
Indicated” film
Sun Jan 27 03:00:00
EET 2013
Sun Jan 27 09:00:00
EET 2013
9009 Awards
ceremony
#somus2014
Baselines
• Random Summarizer: Selects L random tweets.
• Popularity Summarizer: Selects the top L tweets
based on retweet count.
• tf∙idf Summarizer: Uses tf∙idf weight of each tweet
to select top L.
• Cluster-based Summarizer: Creates L clusters using
k-means clustering and selects the highest weighted
message of each cluster.
• LexRank Summarizer: Graph-based method that
assigns a weight on each tweet based on its adjacent
edges.
#28
#somus2014
Timeslot #1 (Stoker & Use Orally as Indicated)
Popularity-based Summarizer
• 5/10 tweets of the summary are related to the Stoker Film → Tends
to cover only a few popular aspects of the event
• Minimizes near-duplicate redundancy, as it uses only the original
tweets.
• "Use Orally as Indicated“ is the second film covered in the summary
(130 RTs)
#29
#somus2014
Timeslot #1 (Stoker & Use Orally as Indicated)
LexRank Summarizer
• 9/10 tweets of the summary are retweets of a tweet related to “Use
Orally as Indicated” film → A lot of redundancy
• These tweets have high degree centrality, as there are many
connections between them.
tf∙idf Summarizer
• Covers two different films (Stoker, Stuart Hall).
• Many tweets about these films.
#30
#somus2014
Timeslot #1 (Stoker & Use Orally as Indicated)
StreamGrid Summarizer
• Covers five different films (The Look of Love, Dirty Wars, Before
Midnight, Kill you Darlings, Life according to Sam)
• There are no duplicates or near-duplicates.
• “Stoker” and “Use Orally as Indicated” are not covered!
• A combination of StreamGrid Summarization and Popularity
Summarization could solve this.
#31
#somus2014
Timeslot #2 (Awards Ceremony)
KPI: Number of winning films covered by the summary
• Popularity-based summarizer outperforms all other approaches:
covers 8 films that won any award that night (Afternoon Delight,
Fruitvale, The Spectacular Now, Blood Brothers, Metro Manila, Dirty
Wars, Crystal Fair, Pussy Riot)
• StreamGrid covers 6 films (Computer Chess, Inequality for all,
Fruitvale, Afternoon Delight, In a world, American Promise).
• Only two films in common → Integrate popularity into StreamGrid
to obtain better results.
• LexRank does not cover any of the winning films, but includes this:
'The Canyons' Snubbed By Sundance Film Festival -- Lindsay Lohan
to Blame?
• tf∙idf Summarizer includes three films but none from the winning
ones!
#32
#somus2014
Multimedia Summaries
#33
Popularity-based summary
StreamGrid summary
Is there any systematic-objective
way to evaluate these?
#somus2014
Conclusions & Future Work
#34
#somus2014
Summary
• Topic modeling approach to capture automatically
the main aspects of the event from a large set of
event-related microblogging messages.
• Peak detection on each topic-related timeline to find
active moments of each topic.
• Use of active topic to select a set of representative
messages for an arbitrary time frame.
• Greedy algorithm for the selection of messages with
respect to content coverage and redundancy
reduction.
#35
#somus2014
Future Work
• Real-time version of StreamGrid framework to get
summaries of evolving and continuous social
streams.
• Investigate how different topic modeling techniques
affect the produced summary.
• Find a more systematic way to evaluate summaries
(especially multimedia!).
#36
#somus2014
Thank you!
#37
Questions?
#somus2014
Key References
• Shou, Lidan, et al. "Sumblr: continuous summarization of evolving tweet
streams." Proceedings of the 36th international ACM SIGIR conference on
Research and development in information retrieval. ACM, 2013.
• Marcus, Adam, et al. "Twitinfo: aggregating and visualizing microblogs for
event exploration." Proceedings of the SIGCHI conference on Human
factors in computing systems. ACM, 2011.
• Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet
allocation." the Journal of machine Learning research 3 (2003): 993-1022.
• Erkan, Günes, and Dragomir R. Radev. "LexRank: Graph-based lexical
centrality as salience in text summarization." J. Artif. Intell. Res.(JAIR) 22.1
(2004): 457-479.
#38

Contenu connexe

En vedette

Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Vasily Leksin
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdaviirpycon
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialVitomir Kovanovic
 
Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)Mudasir Qazi
 
Entity Relationship Diagram
Entity Relationship DiagramEntity Relationship Diagram
Entity Relationship DiagramShakila Mahjabin
 
How to Draw an Effective ER diagram
How to Draw an Effective ER diagramHow to Draw an Effective ER diagram
How to Draw an Effective ER diagramTech_MX
 
Topic Modelling and APIs
Topic Modelling and APIsTopic Modelling and APIs
Topic Modelling and APIsAli Kheyrollahi
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's TutorialWayne Lee
 

En vedette (11)

Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdavi
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
 
Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)
 
Entity Relationship Diagram
Entity Relationship DiagramEntity Relationship Diagram
Entity Relationship Diagram
 
How to Draw an Effective ER diagram
How to Draw an Effective ER diagramHow to Draw an Effective ER diagram
How to Draw an Effective ER diagram
 
Topic Modelling and APIs
Topic Modelling and APIsTopic Modelling and APIs
Topic Modelling and APIs
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 
Vectors
Vectors Vectors
Vectors
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
 

Similaire à StreamGrid: Summarization of large-scale Events using Topic Modeling and Temporal Analysis

Lessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scaleLessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scaleDomonkos Tikk
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingSujit Pal
 
Liberating Structures for Scrum: The Sprint Review
Liberating Structures for Scrum: The Sprint ReviewLiberating Structures for Scrum: The Sprint Review
Liberating Structures for Scrum: The Sprint ReviewStefan Wolpers
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...MOVING Project
 
Timeliner: Early Ideas
Timeliner: Early IdeasTimeliner: Early Ideas
Timeliner: Early IdeasDavid Lamas
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用台灣資料科學年會
 
Agile Engineering for Managers Workshop
Agile Engineering for Managers WorkshopAgile Engineering for Managers Workshop
Agile Engineering for Managers WorkshopPaul Boos
 
Scalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsScalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsYuanyuan Tian
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRoelof Pieters
 
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth RedmoreH2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth RedmoreSri Ambati
 
Three steps to better estimation
Three steps to better estimationThree steps to better estimation
Three steps to better estimationMoss Drake
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptxNibrasulIslam
 
EPG content recommendation in large scale: a case study on interactive TV pla...
EPG content recommendation in large scale: a case study on interactive TV pla...EPG content recommendation in large scale: a case study on interactive TV pla...
EPG content recommendation in large scale: a case study on interactive TV pla...David Zibriczky
 
News-oriented multimedia search over multiple social networks
News-oriented multimedia search over multiple social networksNews-oriented multimedia search over multiple social networks
News-oriented multimedia search over multiple social networksSymeon Papadopoulos
 
Mistakes - I’ve Made a Few. Blunders in Event-driven Architecture | Simon Aub...
Mistakes - I’ve Made a Few. Blunders in Event-driven Architecture | Simon Aub...Mistakes - I’ve Made a Few. Blunders in Event-driven Architecture | Simon Aub...
Mistakes - I’ve Made a Few. Blunders in Event-driven Architecture | Simon Aub...HostedbyConfluent
 

Similaire à StreamGrid: Summarization of large-scale Events using Topic Modeling and Temporal Analysis (20)

Lessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scaleLessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scale
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Liberating Structures for Scrum: The Sprint Review
Liberating Structures for Scrum: The Sprint ReviewLiberating Structures for Scrum: The Sprint Review
Liberating Structures for Scrum: The Sprint Review
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
 
Television News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/SolrTelevision News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/Solr
 
Timeliner: Early Ideas
Timeliner: Early IdeasTimeliner: Early Ideas
Timeliner: Early Ideas
 
Timeliner, early ideas
Timeliner, early ideasTimeliner, early ideas
Timeliner, early ideas
 
Slides ecir2016
Slides ecir2016Slides ecir2016
Slides ecir2016
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 
Agile Engineering for Managers Workshop
Agile Engineering for Managers WorkshopAgile Engineering for Managers Workshop
Agile Engineering for Managers Workshop
 
Scalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on MicroblogsScalable Topic-Specific Influence Analysis on Microblogs
Scalable Topic-Specific Influence Analysis on Microblogs
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and Graphs
 
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth RedmoreH2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
 
Three steps to better estimation
Three steps to better estimationThree steps to better estimation
Three steps to better estimation
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
 
EPG content recommendation in large scale: a case study on interactive TV pla...
EPG content recommendation in large scale: a case study on interactive TV pla...EPG content recommendation in large scale: a case study on interactive TV pla...
EPG content recommendation in large scale: a case study on interactive TV pla...
 
News-oriented multimedia search over multiple social networks
News-oriented multimedia search over multiple social networksNews-oriented multimedia search over multiple social networks
News-oriented multimedia search over multiple social networks
 
News-oriented multimedia search over multiple social networks
News-oriented multimedia search over multiple social networksNews-oriented multimedia search over multiple social networks
News-oriented multimedia search over multiple social networks
 
A Brief History of Stream Processing
A Brief History of Stream ProcessingA Brief History of Stream Processing
A Brief History of Stream Processing
 
Mistakes - I’ve Made a Few. Blunders in Event-driven Architecture | Simon Aub...
Mistakes - I’ve Made a Few. Blunders in Event-driven Architecture | Simon Aub...Mistakes - I’ve Made a Few. Blunders in Event-driven Architecture | Simon Aub...
Mistakes - I’ve Made a Few. Blunders in Event-driven Architecture | Simon Aub...
 

Plus de Symeon Papadopoulos

DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...Symeon Papadopoulos
 
Deepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their DetectionDeepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their DetectionSymeon Papadopoulos
 
Knowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering LocalizationKnowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering LocalizationSymeon Papadopoulos
 
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...Symeon Papadopoulos
 
COVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact TracingCOVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact TracingSymeon Papadopoulos
 
Similarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia contentSimilarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia contentSymeon Papadopoulos
 
Twitter-based Sensing of City-level Air Quality
Twitter-based Sensing of City-level Air QualityTwitter-based Sensing of City-level Air Quality
Twitter-based Sensing of City-level Air QualitySymeon Papadopoulos
 
Aggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media ContentAggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media ContentSymeon Papadopoulos
 
Verifying Multimedia Content on the Internet
Verifying Multimedia Content on the InternetVerifying Multimedia Content on the Internet
Verifying Multimedia Content on the InternetSymeon Papadopoulos
 
A Web-based Service for Image Tampering Detection
A Web-based Service for Image Tampering DetectionA Web-based Service for Image Tampering Detection
A Web-based Service for Image Tampering DetectionSymeon Papadopoulos
 
Learning to detect Misleading Content on Twitter
Learning to detect Misleading Content on TwitterLearning to detect Misleading Content on Twitter
Learning to detect Misleading Content on TwitterSymeon Papadopoulos
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersSymeon Papadopoulos
 
Verifying Multimedia Use at MediaEval 2016
Verifying Multimedia Use at MediaEval 2016Verifying Multimedia Use at MediaEval 2016
Verifying Multimedia Use at MediaEval 2016Symeon Papadopoulos
 
Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Symeon Papadopoulos
 
In-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging PerformanceIn-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging PerformanceSymeon Papadopoulos
 
Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Symeon Papadopoulos
 
Web and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News ProfessionalsWeb and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News ProfessionalsSymeon Papadopoulos
 
Predicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsPredicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsSymeon Papadopoulos
 
Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015Symeon Papadopoulos
 

Plus de Symeon Papadopoulos (20)

DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
 
Deepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their DetectionDeepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their Detection
 
Knowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering LocalizationKnowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering Localization
 
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
 
COVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact TracingCOVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact Tracing
 
Similarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia contentSimilarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia content
 
Twitter-based Sensing of City-level Air Quality
Twitter-based Sensing of City-level Air QualityTwitter-based Sensing of City-level Air Quality
Twitter-based Sensing of City-level Air Quality
 
Aggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media ContentAggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media Content
 
Verifying Multimedia Content on the Internet
Verifying Multimedia Content on the InternetVerifying Multimedia Content on the Internet
Verifying Multimedia Content on the Internet
 
A Web-based Service for Image Tampering Detection
A Web-based Service for Image Tampering DetectionA Web-based Service for Image Tampering Detection
A Web-based Service for Image Tampering Detection
 
Learning to detect Misleading Content on Twitter
Learning to detect Misleading Content on TwitterLearning to detect Misleading Content on Twitter
Learning to detect Misleading Content on Twitter
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
 
Verifying Multimedia Use at MediaEval 2016
Verifying Multimedia Use at MediaEval 2016Verifying Multimedia Use at MediaEval 2016
Verifying Multimedia Use at MediaEval 2016
 
Multimedia Privacy
Multimedia PrivacyMultimedia Privacy
Multimedia Privacy
 
Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...
 
In-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging PerformanceIn-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging Performance
 
Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...
 
Web and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News ProfessionalsWeb and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News Professionals
 
Predicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsPredicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online Discussions
 
Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015
 

Dernier

Film show investigation powerpoint for the site
Film show investigation powerpoint for the siteFilm show investigation powerpoint for the site
Film show investigation powerpoint for the siteAshtonCains
 
c Starting with 5000/- for Savita Escorts Service 👩🏽‍❤️‍💋‍👨🏿 8923113531 ♢ Boo...
c Starting with 5000/- for Savita Escorts Service 👩🏽‍❤️‍💋‍👨🏿 8923113531 ♢ Boo...c Starting with 5000/- for Savita Escorts Service 👩🏽‍❤️‍💋‍👨🏿 8923113531 ♢ Boo...
c Starting with 5000/- for Savita Escorts Service 👩🏽‍❤️‍💋‍👨🏿 8923113531 ♢ Boo...gurkirankumar98700
 
Enjoy Night⚡Call Girls Palam Vihar Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Palam Vihar Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Palam Vihar Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Palam Vihar Gurgaon >༒8448380779 Escort ServiceDelhi Call girls
 
Production diary Film the city powerpoint
Production diary Film the city powerpointProduction diary Film the city powerpoint
Production diary Film the city powerpointAshtonCains
 
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceVellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceDamini Dixit
 
Call Girls In Gurgaon Dlf pHACE 2 Women Delhi ncr
Call Girls In Gurgaon Dlf pHACE 2 Women Delhi ncrCall Girls In Gurgaon Dlf pHACE 2 Women Delhi ncr
Call Girls In Gurgaon Dlf pHACE 2 Women Delhi ncrSapana Sha
 
Top Call Girls In Charbagh ( Lucknow ) 🔝 8923113531 🔝 Cash Payment
Top Call Girls In Charbagh ( Lucknow  ) 🔝 8923113531 🔝  Cash PaymentTop Call Girls In Charbagh ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment
Top Call Girls In Charbagh ( Lucknow ) 🔝 8923113531 🔝 Cash Paymentanilsa9823
 
Film show pre-production powerpoint for site
Film show pre-production powerpoint for siteFilm show pre-production powerpoint for site
Film show pre-production powerpoint for siteAshtonCains
 
Elite Class ➥8448380779▻ Call Girls In New Friends Colony Delhi NCR
Elite Class ➥8448380779▻ Call Girls In New Friends Colony Delhi NCRElite Class ➥8448380779▻ Call Girls In New Friends Colony Delhi NCR
Elite Class ➥8448380779▻ Call Girls In New Friends Colony Delhi NCRDelhi Call girls
 
Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779
Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779
Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779Delhi Call girls
 
Interpreting the brief for the media IDY
Interpreting the brief for the media IDYInterpreting the brief for the media IDY
Interpreting the brief for the media IDYgalaxypingy
 
DickinsonSlides teeeeeeeeeeessssssssssst.pptx
DickinsonSlides teeeeeeeeeeessssssssssst.pptxDickinsonSlides teeeeeeeeeeessssssssssst.pptx
DickinsonSlides teeeeeeeeeeessssssssssst.pptxednyonat
 
Film the city investagation powerpoint :)
Film the city investagation powerpoint :)Film the city investagation powerpoint :)
Film the city investagation powerpoint :)AshtonCains
 
Social media marketing/Seo expert and digital marketing
Social media marketing/Seo expert and digital marketingSocial media marketing/Seo expert and digital marketing
Social media marketing/Seo expert and digital marketingSheikhSaifAli1
 
Factors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptx
Factors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptxFactors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptx
Factors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptxvemusae
 
Lucknow 💋 Dating Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Serv...
Lucknow 💋 Dating Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Serv...Lucknow 💋 Dating Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Serv...
Lucknow 💋 Dating Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Serv...anilsa9823
 

Dernier (20)

Film show investigation powerpoint for the site
Film show investigation powerpoint for the siteFilm show investigation powerpoint for the site
Film show investigation powerpoint for the site
 
c Starting with 5000/- for Savita Escorts Service 👩🏽‍❤️‍💋‍👨🏿 8923113531 ♢ Boo...
c Starting with 5000/- for Savita Escorts Service 👩🏽‍❤️‍💋‍👨🏿 8923113531 ♢ Boo...c Starting with 5000/- for Savita Escorts Service 👩🏽‍❤️‍💋‍👨🏿 8923113531 ♢ Boo...
c Starting with 5000/- for Savita Escorts Service 👩🏽‍❤️‍💋‍👨🏿 8923113531 ♢ Boo...
 
Enjoy Night⚡Call Girls Palam Vihar Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Palam Vihar Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Palam Vihar Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Palam Vihar Gurgaon >༒8448380779 Escort Service
 
Production diary Film the city powerpoint
Production diary Film the city powerpointProduction diary Film the city powerpoint
Production diary Film the city powerpoint
 
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceVellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Vellore Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
 
Call Girls In Gurgaon Dlf pHACE 2 Women Delhi ncr
Call Girls In Gurgaon Dlf pHACE 2 Women Delhi ncrCall Girls In Gurgaon Dlf pHACE 2 Women Delhi ncr
Call Girls In Gurgaon Dlf pHACE 2 Women Delhi ncr
 
Russian Call Girls Rohini Sector 37 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Rohini Sector 37 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Rohini Sector 37 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Rohini Sector 37 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
Top Call Girls In Charbagh ( Lucknow ) 🔝 8923113531 🔝 Cash Payment
Top Call Girls In Charbagh ( Lucknow  ) 🔝 8923113531 🔝  Cash PaymentTop Call Girls In Charbagh ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment
Top Call Girls In Charbagh ( Lucknow ) 🔝 8923113531 🔝 Cash Payment
 
Russian Call Girls Rohini Sector 35 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Rohini Sector 35 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Rohini Sector 35 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Rohini Sector 35 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
Film show pre-production powerpoint for site
Film show pre-production powerpoint for siteFilm show pre-production powerpoint for site
Film show pre-production powerpoint for site
 
Vip Call Girls Tilak Nagar ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Tilak Nagar ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Tilak Nagar ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Tilak Nagar ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Elite Class ➥8448380779▻ Call Girls In New Friends Colony Delhi NCR
Elite Class ➥8448380779▻ Call Girls In New Friends Colony Delhi NCRElite Class ➥8448380779▻ Call Girls In New Friends Colony Delhi NCR
Elite Class ➥8448380779▻ Call Girls In New Friends Colony Delhi NCR
 
Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779
Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779
Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779
 
Interpreting the brief for the media IDY
Interpreting the brief for the media IDYInterpreting the brief for the media IDY
Interpreting the brief for the media IDY
 
DickinsonSlides teeeeeeeeeeessssssssssst.pptx
DickinsonSlides teeeeeeeeeeessssssssssst.pptxDickinsonSlides teeeeeeeeeeessssssssssst.pptx
DickinsonSlides teeeeeeeeeeessssssssssst.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Masudpur
Delhi  99530 vip 56974  Genuine Escort Service Call Girls in MasudpurDelhi  99530 vip 56974  Genuine Escort Service Call Girls in Masudpur
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Masudpur
 
Film the city investagation powerpoint :)
Film the city investagation powerpoint :)Film the city investagation powerpoint :)
Film the city investagation powerpoint :)
 
Social media marketing/Seo expert and digital marketing
Social media marketing/Seo expert and digital marketingSocial media marketing/Seo expert and digital marketing
Social media marketing/Seo expert and digital marketing
 
Factors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptx
Factors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptxFactors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptx
Factors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptx
 
Lucknow 💋 Dating Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Serv...
Lucknow 💋 Dating Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Serv...Lucknow 💋 Dating Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Serv...
Lucknow 💋 Dating Call Girls Lucknow | Whatsapp No 8923113531 VIP Escorts Serv...
 

StreamGrid: Summarization of large-scale Events using Topic Modeling and Temporal Analysis

  • 1. SoMuS 2014 Workshop ICMR, Glasgow, Scotland, 1 April 2014 StreamGrid: Summarization of Large Scale Events using Topic Modelling and Temporal Analysis Manos Schinas, Symeon Papadopoulos, Yiannis Kompatsiaris, Pericles A. Mitkas Information Technologies Institute (ITI) Center for Research & Technology Hellas (CERTH) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki (AUTh)
  • 2. #somus2014 #2 Overview • The Problem • Existing Approaches • StreamGrid • Experimental Study • Summary – Future Work
  • 4. #somus2014 #4 Large-scale Public Events • A lot of attendants using social media • Huge amount of event-related social content #oscars  4.5M tweets #sxsw  1.35M tweets #SB48 (Super Bowl)  24.9M tweets in 4 hours!
  • 5. #somus2014 Large-scale Public Events • Long-running events consist of several sub-events, e.g. 10 days of Sundance Film Festival include opening and awards ceremonies, screenings etc. • Many aspects and entities of interest in the context of an event e.g. films in film festivals, teams in sports events, etc. • Many messages can be considered as spam or non- informative. • Redundancy due to near-duplicate messages #5
  • 6. #somus2014 Event-based Summarization Produce concise multi-document summaries for a given event, covering its main aspects. #6 Event-based Summarizer List of all messages Set of Selected Messages
  • 8. #somus2014 #8 Existing Approaches Radev et al. 2004 (baseline) • Summary consists of the messages closest to the tf∙idf centroid of all messages Shen et al. 2013 • Mixture model to detect sub-events at participant level • tf∙idf centroid to find a summary of each sub-event Chakrabarti & Punera 2011 • Hidden Markov Model to obtain a time-based segmentation • tf∙idf centroid to find a summary of time segment
  • 9. #somus2014 Existing Approaches Erkan et al. 2004 (LexRank) • Graph-based approach to find salient sentences • Uses centrality of each sentence in a similarity graph • Adapted for multi-document summarization using each message as a sentence • Outperforms naïve centroid-based approach Shen et al. 2013 • Online clustering algorithm to find sub-events. • Greedy algorithm for summarization using the LexRank score of each message. #9
  • 11. #somus2014 StreamGrid Overview • Find topics using Latent Dirichlet Allocation (LDA) • Create a timeline for each topic • Create StreamGrid structure • Summarize using StreamGrid #11
  • 12. #somus2014 Topic Modeling using LDA • To work with very short documents (tweets), LDA needs some kind of message pooling • Number of topics estimation – Minimize: (a) total perplexity for a set of test documents and (b) average textual similarity across topics #12 Microblog messages merge Pooling Schemes • Time proximity • Same Author • Same Hashtag • Textual similarity Merged messages
  • 13. #somus2014 Topic Modeling using LDA • Split documents D to Dtrain and Dtest • Estimate K topics over Dtrain • Calculate total perplexity of Dtest #13
  • 14. #somus2014 StreamGrid Creation • Assign each message to the topic with the highest probability p under condition p > pth (spam messages are discarded) • Create StreamGrid #14 time interval j topic i cell c(i,j) = {set of messages associated with topic i, posted during time interval j }
  • 15. #somus2014 StreamGrid Creation • For each cell c(i,j) calculate a merged tf∙idf vector uij • For each term t calculate the weight: where tfij(t) is the frequency of t in cell c(i,j) • For each message m of c(i,j) calculate the weight: #15
  • 16. #somus2014 StreamGrid Creation • Detect active cells of each topic by applying peak detection on the associated topic timeline. • Given a topic i and a detected peak in time window [a,b], all cells c(i,j), a < j < b, are defined as active. • For the set of active topics A during a time interval j, calculate a significance score: #16
  • 17. #somus2014 StreamGrid Creation • To get an overall estimation of the importance of each topic throughout the event we calculate two measures: #17
  • 18. #somus2014 Topic-time Summarization • Our goal is the generation of a summary of an event for an arbitrary time frame F=[x1,x2]. • Summary has to meet the following criteria – As many aspects of the event are covered – Redundancy due to near duplicate messages are minimized • We use a greedy algorithm that selects important messages from each active topic in F and minimizes redundancy simultaneously. #18
  • 19. #somus2014 Topic-time Summarization • A topic i is active in F if any of the cells contained in F is active. • The significance score of an active topic i in F is the max significance score across all time intervals in F. • The weight W(m,F) of a message m in F is the sum of the weights in each time interval. #19 Time frame F’ Active topics in F’ Time frame F Active topics in F
  • 20. #somus2014 Topic-time Summarization: Algorithm Input: StreamGrid, time frame F, summary length L Output: summary set S 1. Get active topics in F 2. for each active topic select message with highest weight Mc 3. while |S|<L do 4. for each message m in Mc do 5. calculate score(m) 6. end for 7. Add message with highest score to S and remove it from Mc 8. end while #20
  • 21. #somus2014 Topic-time Summarization • The score of a message m is a combination of its importance and of the redundancy introduced by its selection. • Redundancy is the average textual similarity among the set of already selected messages S #21
  • 23. #somus2014 Dataset & Event Sundance Film Festival • Two week festival: Jan 15-30, 2013 • Data collection based on Streaming API with the following parameters: – hashtags: #sundance, #sundance2013, #sundancefest – account: @sundancefest • Total number of tweets: 201,752 • Total number of original tweets: 100,046 #23
  • 24. #somus2014 Topic Modeling • Merge messages with the same hashtag gave the best results with respect to perplexity. • Main trend for perplexity is to decrease as K increases. • Average similarity between clusters stabilized for K>200 → K = 200 #24
  • 26. #somus2014 Event Timeline #26 Awards ceremony “Stoker” film by Chan-wook Park & “Use Orally as Indicated” film
  • 27. #somus2014 Selected Timeslots • Evaluate using two timeslots with high activity. • The first time frame has a small number of very popular tweets mainly about two films. • The second is a more diverse set of tweets. • A good measure of the quality of a summary is the number of films covered. #27 From To Tweets Description Mon Jan 21 05:00:00 EET 2013 Mon Jan 21 06:00:00 EET 2013 5755 “Stoker” film by Chan-wook Park & “Use Orally as Indicated” film Sun Jan 27 03:00:00 EET 2013 Sun Jan 27 09:00:00 EET 2013 9009 Awards ceremony
  • 28. #somus2014 Baselines • Random Summarizer: Selects L random tweets. • Popularity Summarizer: Selects the top L tweets based on retweet count. • tf∙idf Summarizer: Uses tf∙idf weight of each tweet to select top L. • Cluster-based Summarizer: Creates L clusters using k-means clustering and selects the highest weighted message of each cluster. • LexRank Summarizer: Graph-based method that assigns a weight on each tweet based on its adjacent edges. #28
  • 29. #somus2014 Timeslot #1 (Stoker & Use Orally as Indicated) Popularity-based Summarizer • 5/10 tweets of the summary are related to the Stoker Film → Tends to cover only a few popular aspects of the event • Minimizes near-duplicate redundancy, as it uses only the original tweets. • "Use Orally as Indicated“ is the second film covered in the summary (130 RTs) #29
  • 30. #somus2014 Timeslot #1 (Stoker & Use Orally as Indicated) LexRank Summarizer • 9/10 tweets of the summary are retweets of a tweet related to “Use Orally as Indicated” film → A lot of redundancy • These tweets have high degree centrality, as there are many connections between them. tf∙idf Summarizer • Covers two different films (Stoker, Stuart Hall). • Many tweets about these films. #30
  • 31. #somus2014 Timeslot #1 (Stoker & Use Orally as Indicated) StreamGrid Summarizer • Covers five different films (The Look of Love, Dirty Wars, Before Midnight, Kill you Darlings, Life according to Sam) • There are no duplicates or near-duplicates. • “Stoker” and “Use Orally as Indicated” are not covered! • A combination of StreamGrid Summarization and Popularity Summarization could solve this. #31
  • 32. #somus2014 Timeslot #2 (Awards Ceremony) KPI: Number of winning films covered by the summary • Popularity-based summarizer outperforms all other approaches: covers 8 films that won any award that night (Afternoon Delight, Fruitvale, The Spectacular Now, Blood Brothers, Metro Manila, Dirty Wars, Crystal Fair, Pussy Riot) • StreamGrid covers 6 films (Computer Chess, Inequality for all, Fruitvale, Afternoon Delight, In a world, American Promise). • Only two films in common → Integrate popularity into StreamGrid to obtain better results. • LexRank does not cover any of the winning films, but includes this: 'The Canyons' Snubbed By Sundance Film Festival -- Lindsay Lohan to Blame? • tf∙idf Summarizer includes three films but none from the winning ones! #32
  • 33. #somus2014 Multimedia Summaries #33 Popularity-based summary StreamGrid summary Is there any systematic-objective way to evaluate these?
  • 35. #somus2014 Summary • Topic modeling approach to capture automatically the main aspects of the event from a large set of event-related microblogging messages. • Peak detection on each topic-related timeline to find active moments of each topic. • Use of active topic to select a set of representative messages for an arbitrary time frame. • Greedy algorithm for the selection of messages with respect to content coverage and redundancy reduction. #35
  • 36. #somus2014 Future Work • Real-time version of StreamGrid framework to get summaries of evolving and continuous social streams. • Investigate how different topic modeling techniques affect the produced summary. • Find a more systematic way to evaluate summaries (especially multimedia!). #36
  • 38. #somus2014 Key References • Shou, Lidan, et al. "Sumblr: continuous summarization of evolving tweet streams." Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2013. • Marcus, Adam, et al. "Twitinfo: aggregating and visualizing microblogs for event exploration." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2011. • Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022. • Erkan, Günes, and Dragomir R. Radev. "LexRank: Graph-based lexical centrality as salience in text summarization." J. Artif. Intell. Res.(JAIR) 22.1 (2004): 457-479. #38