SlideShare une entreprise Scribd logo
1  sur  96
Télécharger pour lire hors ligne
SURFACING REAL-WORLD
EVENT CONTENT ON TWITTER
Hila Becker, Luis Gravano Mor Naaman
Columbia University Rutgers University
Event Content in Social Media
Event Content in Social Media
Smaller events, without traditional
news coverage
Popular, widely known events
Event Content in Social Media
 Discovery
 Detect events using features of social media content (e.g.,
term statistics)
 Mining content from known event sources (e.g., user-
contributed event databases)
 Organization
 Associating social media content with events
 Identifying similar content within and across sites
 Presentation
 Selecting what content to display to a user
 Providing interfaces that summarize and aggregate the
content along different dimensions
Event Content in Social Media
 Discovery
 Detect events using features of social media content (e.g.,
term statistics)
 Mining content from known event sources (e.g., user-
contributed event databases)
 Organization
 Associating social media content with events
 Identifying similar content within and across sites
 Presentation
 Selecting what content to display to a user
 Providing interfaces that summarize and aggregate the
content along different dimensions
Event Content in Social Media
 Discovery
 Detect events using features of social media content (e.g.,
term statistics)
 Mining content from known event sources (e.g., user-
contributed event databases)
 Organization
 Associating social media content with events
 Identifying similar content within and across sites
 Presentation
 Selecting what content to display to a user
 Providing interfaces that summarize and aggregate the
content along different dimensions
Identifying Events in Social Media
 Timeliness
 Real-time
 Retrospective
 (Prospective)
 Content discovery
 Known properties
 Event databases (e.g., Upcoming, Eventful)
 Keyword triggers (e.g, “earthquake”)
 Shared calendars
 Unknown properties
Identifying Events in Social Media
 Timeliness
 Real-time
 Retrospective
 (Prospective)
 Content discovery
 Known properties
 Event databases (e.g., Upcoming, Eventful)
 Keyword triggers (e.g, “earthquake”)
 Shared calendars
 Unknown properties
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Twitter new event detection
[Petrović et al. NAACL’10]
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Twitter new event detection
[Petrović et al. NAACL’10]
Event detection on Flickr
[Chen and Roy CIKM’09]
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Earthquake prediction
using Twitter [Sakaki et al.
WWW’10]
Twitter new event detection
[Petrović et al. NAACL’10]
Event detection on Flickr
[Chen and Roy CIKM’09]
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Earthquake prediction
using Twitter [Sakaki et al.
WWW’10]
Twitter new event detection
[Petrović et al. NAACL’10]
Event detection on Flickr
[Chen and Roy CIKM’09]
Organization of YouTube
concert videos [Kennedy and
Naaman WWW’09]
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Surfacing events on
Twitter
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Learning similarity metrics
for event identification on
Flickr [Becker et al. WSDM’10]
Surfacing events on
Twitter
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Learning similarity metrics
for event identification on
Flickr [Becker et al. WSDM’10]
Surfacing events on
Twitter
Identifying Twitter content
for planned events
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Learning similarity metrics
for event identification on
Flickr [Becker et al. WSDM’10]
Surfacing events on
Twitter
Identifying Twitter content
for planned events
Connecting events across
sites (e.g., YouTube,
Picasa)
Twitter Content
 Streams of textual
messages
 Brief content (140
characters)
 Communicated to network
of followers
Twitter Trending Topics
Twitter trending topics, September 24, 2010 7:00am
Twitter Trending Topics
Twitter trending topics, September 24, 2010 7:00am
 Recurring
 Twitter-centric
 Confusing
 Real-World
Events?
Twitter Trending Topics
Twitter trending topics, September 24, 2010 7:00am
 Recurring
 Twitter-centric
 Confusing
 Real-World
Events?
Twitter Trending Topics
Twitter trending topics, September 24, 2010 7:00am
 Recurring
 Twitter-centric
 Confusing
 Real-World
Events?
Twitter Trending Topics
Twitter trending topics, September 24, 2010 7:00am
 Recurring
 Twitter-centric
 Confusing
 Real-World
Events?
Twitter Trending Topics
Twitter trending topics, September 24, 2010 7:00am
 Recurring
 Twitter-centric
 Confusing
 Real-World
Events?
Identifying Events on Twitter
 Challenges:
 Wide variety of topics, not all related to events (e.g.,
morning greetings, “thank you” messages)
 Low quality text: abbreviations, unconventional language,
riddled with typos, grammatically incorrect
 Opportunities:
 Content generated in real-time as events happen
 Time and location information
Identifying Events on Twitter
 Challenges:
 Wide variety of topics, not all related to events (e.g.,
morning greetings, “thank you” messages)
 Low quality text: abbreviations, unconventional language,
riddled with typos, grammatically incorrect
 Opportunities:
 Content generated in real-time as events happen
 Time and location information
Events on Twitter
 Types of events on Twitter
 Exogenous: Real-world occurrences (e.g., Superbowl,
“Lost” finale)
 Endogenous: Specific to the Twitter-verse (e.g.,
#thingsyoushouldntsay meme, RT statement by Lady
Gaga)
 Event:
 One or more terms and a time period
 Volume of messages posted for the terms in the time
period exceeds some expected level of activity
Events on Twitter
 Types of events on Twitter
 Exogenous: Real-world occurrences (e.g., Superbowl,
“Lost” finale)
 Endogenous: Specific to the Twitter-verse (e.g.,
#thingsyoushouldntsay meme, RT statement by Lady
Gaga)
 Event:
 One or more terms and a time period
 Volume of messages posted for the terms in the time
period exceeds some expected level of activity
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Group similar content via clustering
 Discovery
 Extract discriminating features of clusters
 Build an event classifier
 Presentation
 Select content for each event
 Evaluate the quality, relevance, and usefulness
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Group similar content via clustering
 Discovery
 Extract discriminating features of clusters
 Build an event classifier
 Presentation
 Select content for each event
 Evaluate the quality, relevance, and usefulness
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Group similar content via clustering
 Discovery
 Extract discriminating features of clusters
 Build an event classifier
 Presentation
 Select content for each event
 Evaluate the quality, relevance, and usefulness
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Group similar content via clustering
 Discovery
 Extract discriminating features of clusters
 Build an event classifier
 Presentation
 Select content for each event
 Evaluate the quality, relevance, and usefulness
Surfacing Event Content on Twitter
Tweets
Surfacing Event Content on Twitter
Tweets
Surfacing Event Content on Twitter
Tweet Clusters
Tweets
Surfacing Event Content on Twitter
Tweet Clusters
Tweets
Surfacing Event Content on Twitter
Tweet Clusters
Tweets Event Clusters
Surfacing Event Content on Twitter
Tweet Clusters
Tweets Event Clusters
Surfacing Event Content on Twitter
Tweet Clusters
Tweets Event Clusters
Surfacing Event Content on Twitter
Tweet Clusters
Tweets Event Clusters Selected Tweets
Organizing Tweets in Real-Time
 Order tweets by post time
 Use TF-IDF vector representation of textual content
 Stop word elimination
 Stemming
 Enhanced weight for hashtags (#tag)
 IDF computed over past data
 Separate tweets by location
 Focus on tweets from NYC
 Different locations can be processed in parallel
Organizing Tweets in Real-Time
 Order tweets by post time
 Use TF-IDF vector representation of textual content
 Stop word elimination
 Stemming
 Enhanced weight for hashtags (#tag)
 IDF computed over past data
 Separate tweets by location
 Focus on tweets from NYC
 Different locations can be processed in parallel
Organizing Tweets in Real-Time
 Order tweets by post time
 Use TF-IDF vector representation of textual content
 Stop word elimination
 Stemming
 Enhanced weight for hashtags (#tag)
 IDF computed over past data
 Separate tweets by location
 Focus on tweets from NYC
 Different locations can be processed in parallel
Clustering Algorithm
 Many alternatives possible! [Berkhin 2002]
 Single-pass incremental clustering algorithm
 Scalable, online solution
 Used effectively for
 Event identification in textual news [Allan et al. 1998]
 News event detection on Twitter [Sankaranarayanan et al. 2009]
 Does not require a priori knowledge of number of
clusters
 Known fragmentation issue, often solved with a
periodic second pass
Clustering Algorithm
 Many alternatives possible! [Berkhin 2002]
 Single-pass incremental clustering algorithm
 Scalable, online solution
 Used effectively for
 Event identification in textual news [Allan et al. 1998]
 News event detection on Twitter [Sankaranarayanan et al. 2009]
 Does not require a priori knowledge of number of
clusters
 Known fragmentation issue, often solved with a
periodic second pass
Overview of Cluster-based Approach
 Group similar tweets via online clustering
 Compute statistics of cluster content
 Top terms (e.g., [earthquake, haiti])
 Number of documents per hour
 …
 Use cluster-level features to identify event clusters
 Single feature with threshold (e.g., increase in volume
over time-window)
 Trained classification model
Overview of Cluster-based Approach
 Group similar tweets via online clustering
 Compute statistics of cluster content
 Top terms (e.g., [earthquake, haiti])
 Number of documents per hour
 …
 Use cluster-level features to identify event clusters
 Single feature with threshold (e.g., increase in volume
over time-window)
 Trained classification model
Overview of Cluster-based Approach
 Group similar tweets via online clustering
 Compute statistics of cluster content
 Top terms (e.g., [earthquake, haiti])
 Number of documents per hour
 …
 Use cluster-level features to identify event clusters
 Single feature with threshold (e.g., increase in volume
over time-window)
 Trained classification model
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Group similar content via clustering
 Discovery
 Extract discriminating features of clusters
 Build an event classifier
 Presentation
 Select content for each event
 Evaluate the quality, relevance, and usefulness
Social Interaction Features
 Retweets
 RT @username
 Often characterize
Twitter-specific events
 Replies
 Tweet starts with
@username
 Possible indication of
non-event content
 Mentions
 @username anywhere
in the tweet
 Reference to twitter
users that might be
part of an event
Social Interaction Features
 Retweets
 RT @username
 Often characterize
Twitter-specific events
 Replies
 Tweet starts with
@username
 Possible indication of
non-event content
 Mentions
 @username anywhere
in the tweet
 Reference to twitter
users that might be
part of an event
Social Interaction Features
 Retweets
 RT @username
 Often characterize
Twitter-specific events
 Replies
 Tweet starts with
@username
 Possible indication of
non-event content
 Mentions
 @username anywhere
in the tweet
 Reference to twitter
users that might be
part of an event
Topic Coherence
Intuition: clusters with strong inter-document similarity
may contain event information
Class
Today
Early
Work
Sleep
Start
I’m gonna do my best to go
sleep during all my classes
today =)
Starting work early today.
Looking fwd to cooking class
tonight!
Today starts the rest of my
life…
Katie
Couric
President
Obama
Interview
CBS
Katie Couric Interview With
President Obama
http://bit.ly/bRsGPo
The Katie Couric-President
Obama interview has now
begun on CBS
Katie Couric interviews
President Obama during CBS'
Super Bowl pregame coverage
Trending Behavior
 Trending
characteristics of
top terms in
cluster:
 Exponential fit
 Deviation from
expected
volume
Volume over time for the term “valentine”
time
documents
time (hours)
Twitter-Centric Event Features
 Tagging behavior
 Multi-word tags (e.g., #myhomelesssignwouldsay)
 Percentage of tagged tweets
 Top term is a tag
 …
 Retweeting
 Percentage of messages with RT @
 Percentage of messages from top RTed tweet
 …
Twitter-Centric Event Features
 Tagging behavior
 Multi-word tags (e.g., #myhomelesssignwouldsay)
 Percentage of tagged tweets
 Top term is a tag
 …
 Retweeting
 Percentage of messages with RT @
 Percentage of messages from top RTed tweet
 …
Event Classifier
 Use features to build a classifier
 Human-annotated training data
 SVM model (selected during training phase)
 Alternative classification modes:
 RW-Event: real-world event vs. rest
 TC-Event: event (real-world or Twitter-centric) vs. non-
event
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Group similar content via clustering
 Discovery
 Extract discriminating features of clusters
 Build an event classifier
 Presentation
 Select content for each event
 Evaluate the quality, relevance, and usefulness
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Group similar content via clustering
 Discovery
 Extract discriminating features of clusters
 Build an event classifier
 Presentation
 Select content for each event
 Evaluate the quality, relevance, and usefulness
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Group similar content via clustering
 Discovery
 Extract discriminating features of clusters
 Build an event classifier
 Presentation
 Select content for each event
 Evaluate the quality, relevance, and usefulness
Event Content Selection
Tiger
Woods
Apology
Event Content Selection
Tiger
Woods
Apology
Tiger Woods to make a
public apology Friday and
talk about his future in golf.
Tiger Woods Returns To
Golf - Public Apology
http://bit.ly/9Ui5jx
Tiger woods y'all,tiger
woods y'all,ah tiger woods
y'all
Tiger Woods Hugs:
http://tinyurl.com/yhf4
uzw
Wedge wars upstage
Watson v Woods: BBC
Sport (blog)
Event Content Selection
Tiger
Woods
Apology
Tiger Woods to make a
public apology Friday and
talk about his future in golf.
Tiger Woods Returns To
Golf - Public Apology
http://bit.ly/9Ui5jx
Tiger woods y'all,tiger
woods y'all,ah tiger woods
y'all
Tiger Woods Hugs:
http://tinyurl.com/yhf4
uzw
Wedge wars upstage
Watson v Woods: BBC
Sport (blog)
Event Content Selection
 Challenges:
 Clusters contain noise
 Relevant tweets might have poor quality text
 Relevant, high quality tweets might not be interesting
 For each tweet and a given event evaluate
 Quality
 Relevance
 Usefulness
Event Content Selection
 Challenges:
 Clusters contain noise
 Relevant tweets might have poor quality text
 Relevant, high quality tweets might not be interesting
 For each tweet and a given event evaluate
 Quality
 Relevance
 Usefulness
Centrality Based Tweet Selection
 Centroid
Cosine similarity of each tweet to cluster centroid
 Degree
 Tweets are nodes
 Tweets are connected if their similarity is above a
threshold
 Compute degree centrality of each node
 LexRank [Erkan and Radev 2004]
 Same graph structure as Degree method
 Central tweets are similar to other central tweets
Centrality Based Tweet Selection
 Centroid
Cosine similarity of each tweet to cluster centroid
 Degree
 Tweets are nodes
 Tweets are connected if their similarity is above a
threshold
 Compute degree centrality of each node
 LexRank [Erkan and Radev 2004]
 Same graph structure as Degree method
 Central tweets are similar to other central tweets
Centrality Based Tweet Selection
 Centroid
Cosine similarity of each tweet to cluster centroid
 Degree
 Tweets are nodes
 Tweets are connected if their similarity is above a
threshold
 Compute degree centrality of each node
 LexRank [Erkan and Radev 2004]
 Same graph structure as Degree method
 Central tweets are similar to other central tweets
Experimental Setup: Data
 >2,600,000 tweets, collected via Twitter API
 Location: New York City area
Indicated on user profile
 Time: February 2010
 First week used to calibrate statistics
 Second week used for training/validation
 Third and fourth weeks used for testing
Experimental Setup: Data
 >2,600,000 tweets, collected via Twitter API
 Location: New York City area
Indicated on user profile
 Time: February 2010
 First week used to calibrate statistics
 Second week used for training/validation
 Third and fourth weeks used for testing
Experimental Setup: Data
 >2,600,000 tweets, collected via Twitter API
 Location: New York City area
Indicated on user profile
 Time: February 2010
 First week used to calibrate statistics
 Second week used for training/validation
 Third and fourth weeks used for testing
Experimental Setup: Training
 Data:
 504 clusters
 Fastest growing clusters/hour in second week of February
2010
 Labels:
 Real-world event (e.g., [superbowl,colts,saints,sb44])
 Twitter-specific event (e.g., [uknowubrokewhen,money,job])
 Non-event (e.g., [happy,love,lol])
 Ambiguous cluster (e.g., [south,park,west,sxsw,cartman])
Experimental Setup: Training
 Data:
 504 clusters
 Fastest growing clusters/hour in second week of February
2010
 Labels:
 Real-world event (e.g., [superbowl,colts,saints,sb44])
 Twitter-specific event (e.g., [uknowubrokewhen,money,job])
 Non-event (e.g., [happy,love,lol])
 Ambiguous cluster (e.g., [south,park,west,sxsw,cartman])
Experimental Setup: Testing
 Baselines:
 Naïve Bayes text classification (NB-Text)
 Fastest-growing clusters per hour
 Classifiers:
 RW-Event
 TC-Event
 400 clusters
 5 hours
 Top 20 clusters per hour according to RW-Event, TC-
Event, Fastest-growing, random
Experimental Setup: Testing
 Baselines:
 Naïve Bayes text classification (NB-Text)
 Fastest-growing clusters per hour
 Classifiers:
 RW-Event
 TC-Event
 400 clusters
 5 hours
 Top 20 clusters per hour according to RW-Event, TC-
Event, Fastest-growing, random
Experimental Setup: Testing
 Baselines:
 Naïve Bayes text classification (NB-Text)
 Fastest-growing clusters per hour
 Classifiers:
 RW-Event
 TC-Event
 400 clusters
 5 hours
 Top 20 clusters per hour according to RW-Event, TC-
Event, Fastest-growing, random
Experimental Methodology: Event
Classification
 Classification accuracy
 10-fold cross validation
 Separate test set of randomly chosen tweets
 Event surfacing
 Top events per hour for each technique
 Evaluation:
 Precision@K
 NDCG@K
Experimental Methodology: Event
Classification
 Classification accuracy
 10-fold cross validation
 Separate test set of randomly chosen tweets
 Event surfacing
 Top events per hour for each technique
 Evaluation:
 Precision@K
 NDCG@K
Identified Events
Description Keywords
Senator Evan Bayh's Retirement bayh, evan, senate, congress, retire
Westminster Dog Show westminster, dog, show, club, kennel
Obama’s Meeting with the Dalai Lama lama, dalai, meet, obama, china
NYC Toy Fair toyfairny, starwars, hasbro, lego, toy
Marc Jacobs Fashion Show jacobs, marc, nyfw, show, fashion
A sample of events identified by our classifiers on the test set
Classification Performance (F-measure)
 RW-Event classifier is more effective at
discriminating between real-world events and rest
of Twitter data
Classifier Validation Test
NB-Text 0.785 0.702
RW-Event 0.849 0.837
TC-Event 0.875 0.789
Precision@K Evaluation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20
Precision
Number of Clusters (K)
RW-Event
TC-Event
Fastest
Random
NDCG@K Evaluation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20
NDCG
Number of Clusters (K)
RW-Event
TC-Event
Fastest
Random
Experimental Methodology:
Content Selection
 50 event clusters
 Randomly selected from test set
 5 top tweets per event for each: Centroid, Degree,
LexRank
 Labeled on a 1-4 scale
 Quality: excellent (4)  poor (1)
 Relevance: clearly relevant (4)  not relevant (1)
 Usefulness: clearly useful (4)  not useful (1)
Selected Tweets: Example
Method Tweet
Centroid
Video: Tiger regretful; unsure about return to golf - Main Line ...:
(AP) Tiger Woods publicly apologized Friday...
http://bit.ly/dAO41N
Degree
Watson: Woods needs to show humility upon return (AP): Tom
Watson says Tiger Woods needs to "show some humility to...
http://bit.ly/cHVH7x
LexRank
RT @EricStangel: Tiger Woods statement: And now for Elin's
repsonse....
A sample of tweets selected by different centrality methods
Content Selection Results
 Average scores over all events
 High quality and relevance (>3) for both Degree
and Centroid
 Centroid only method with high usefulness
Method Quality Relevance Usefulness
LexRank 3.444 2.984 2.608
Degree 3.536 3.156 2.802
Centroid 3.636 3.694 3.474
Preferred Method per Event
 Centroid is the preferred method across all metrics
For usefulness, Centroid tweets preferred more than 2:1
compared to Degree, 4:1 compared to LexRank
Method Quality Relevance Usefulness
LexRank 22.66% 16.33% 12%
Degree 31.66% 25.33% 28%
Centroid 45.66% 58.33% 60%
Conclusions
Techniques for discovering, organizing, and presenting
social media from real-world events
 Event classifiers
 Important to capture features of Twitter-specific events in
order to reveal the real-world events
 Effectively surfaced real-world events in an unsupervised
setting
 Content selection
 Similarity to centroid technique better at selecting event
content
 There is relevant and useful event content on Twitter!
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Learning similarity metrics
for event identification on
Flickr [Becker et al. WSDM’10]
Surfacing events on
Twitter
Identifying Twitter content
for planned events
Connecting events across
sites (e.g., YouTube,
Picasa)
Learning Similarity Metrics for Event
Identification in Social Media (WSDM ’10)
Ctitle
Ctags
Ctime
Combine
similarities
Learning Similarity Metrics for Event
Identification in Social Media (WSDM ’10)
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctags
Ctime
Learned in a
training step
Combine
similarities
Learning Similarity Metrics for Event
Identification in Social Media (WSDM ’10)
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctags
Ctime
Final
clustering
solution
Learned in a
training step
Identifying Tweets for Known Events
Identifying Tweets for Known Events
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Learning similarity metrics
for event identification on
Flickr [Becker et al. WSDM’10]
Surfacing events on
Twitter
Identifying Twitter content
for planned events
Connecting events across
sites (e.g., YouTube,
Picasa)
Thank you!
 Pablo Barrio
 David Elson
 Dan Iter
 Yves Petinot
 Sara Rosenthal
 Gonçalo Simões
 Matt Solomon
 Kapil Thadani

Contenu connexe

Similaire à Surfacing Real-World Event Content on Twitter

Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Artificial Intelligence Institute at UofSC
 
Opportunities with real time local search and content
Opportunities with real time local search and contentOpportunities with real time local search and content
Opportunities with real time local search and contentSebastien Provencher
 
Semantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event NotificationSemantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event Notificationokazaki117
 
Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterKrist Wongsuphasawat
 
Metrics, Metrics, Everywhere - Choosing the Right Ones for Your Website and S...
Metrics, Metrics, Everywhere - Choosing the Right Ones for Your Website and S...Metrics, Metrics, Everywhere - Choosing the Right Ones for Your Website and S...
Metrics, Metrics, Everywhere - Choosing the Right Ones for Your Website and S...Brian Alpert
 
Rob Procter
Rob ProcterRob Procter
Rob ProcterNSMNSS
 
Research power point for students
Research power point for studentsResearch power point for students
Research power point for studentslarchmeany1
 
final_nlp
final_nlpfinal_nlp
final_nlpaphex34
 
What to expect when you are visualizing
What to expect when you are visualizingWhat to expect when you are visualizing
What to expect when you are visualizingKrist Wongsuphasawat
 
Making Sense of Millions of Thoughts: Finding Patterns in the Tweets
Making Sense of Millions of Thoughts: Finding Patterns in the TweetsMaking Sense of Millions of Thoughts: Finding Patterns in the Tweets
Making Sense of Millions of Thoughts: Finding Patterns in the TweetsKrist Wongsuphasawat
 
AI-powered SEO - Structured Data & Semantics - WordLift for SMXL Milan 2017
AI-powered SEO - Structured Data & Semantics - WordLift for SMXL Milan 2017AI-powered SEO - Structured Data & Semantics - WordLift for SMXL Milan 2017
AI-powered SEO - Structured Data & Semantics - WordLift for SMXL Milan 2017Andrea Volpini
 

Similaire à Surfacing Real-World Event Content on Twitter (20)

Twitris
TwitrisTwitris
Twitris
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
 
Opportunities with real time local search and content
Opportunities with real time local search and contentOpportunities with real time local search and content
Opportunities with real time local search and content
 
Data Strategy
Data StrategyData Strategy
Data Strategy
 
Semantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event NotificationSemantic Twitter Analyzing Tweets For Real Time Event Notification
Semantic Twitter Analyzing Tweets For Real Time Event Notification
 
Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at Twitter
 
Nonprofit social graph
Nonprofit social graphNonprofit social graph
Nonprofit social graph
 
Metrics, Metrics, Everywhere - Choosing the Right Ones for Your Website and S...
Metrics, Metrics, Everywhere - Choosing the Right Ones for Your Website and S...Metrics, Metrics, Everywhere - Choosing the Right Ones for Your Website and S...
Metrics, Metrics, Everywhere - Choosing the Right Ones for Your Website and S...
 
Rob Procter
Rob ProcterRob Procter
Rob Procter
 
Data Visualization at Twitter
Data Visualization at TwitterData Visualization at Twitter
Data Visualization at Twitter
 
Real-Time Web; Trending Social Data
Real-Time Web; Trending Social DataReal-Time Web; Trending Social Data
Real-Time Web; Trending Social Data
 
SVA Workshop 072311
SVA Workshop 072311SVA Workshop 072311
SVA Workshop 072311
 
SVA Workshop 0711
SVA Workshop 0711SVA Workshop 0711
SVA Workshop 0711
 
Research power point for students
Research power point for studentsResearch power point for students
Research power point for students
 
final_nlp
final_nlpfinal_nlp
final_nlp
 
What to expect when you are visualizing
What to expect when you are visualizingWhat to expect when you are visualizing
What to expect when you are visualizing
 
Understanding User-Community Engagement by Multi-faceted Features: A Case ...
Understanding User-Community Engagement by Multi-faceted Features: A Case ...Understanding User-Community Engagement by Multi-faceted Features: A Case ...
Understanding User-Community Engagement by Multi-faceted Features: A Case ...
 
Making Sense of Millions of Thoughts: Finding Patterns in the Tweets
Making Sense of Millions of Thoughts: Finding Patterns in the TweetsMaking Sense of Millions of Thoughts: Finding Patterns in the Tweets
Making Sense of Millions of Thoughts: Finding Patterns in the Tweets
 
Trend Analysis
Trend AnalysisTrend Analysis
Trend Analysis
 
AI-powered SEO - Structured Data & Semantics - WordLift for SMXL Milan 2017
AI-powered SEO - Structured Data & Semantics - WordLift for SMXL Milan 2017AI-powered SEO - Structured Data & Semantics - WordLift for SMXL Milan 2017
AI-powered SEO - Structured Data & Semantics - WordLift for SMXL Milan 2017
 

Dernier

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Dernier (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Surfacing Real-World Event Content on Twitter

  • 1. SURFACING REAL-WORLD EVENT CONTENT ON TWITTER Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University
  • 2. Event Content in Social Media
  • 3. Event Content in Social Media Smaller events, without traditional news coverage Popular, widely known events
  • 4. Event Content in Social Media  Discovery  Detect events using features of social media content (e.g., term statistics)  Mining content from known event sources (e.g., user- contributed event databases)  Organization  Associating social media content with events  Identifying similar content within and across sites  Presentation  Selecting what content to display to a user  Providing interfaces that summarize and aggregate the content along different dimensions
  • 5. Event Content in Social Media  Discovery  Detect events using features of social media content (e.g., term statistics)  Mining content from known event sources (e.g., user- contributed event databases)  Organization  Associating social media content with events  Identifying similar content within and across sites  Presentation  Selecting what content to display to a user  Providing interfaces that summarize and aggregate the content along different dimensions
  • 6. Event Content in Social Media  Discovery  Detect events using features of social media content (e.g., term statistics)  Mining content from known event sources (e.g., user- contributed event databases)  Organization  Associating social media content with events  Identifying similar content within and across sites  Presentation  Selecting what content to display to a user  Providing interfaces that summarize and aggregate the content along different dimensions
  • 7. Identifying Events in Social Media  Timeliness  Real-time  Retrospective  (Prospective)  Content discovery  Known properties  Event databases (e.g., Upcoming, Eventful)  Keyword triggers (e.g, “earthquake”)  Shared calendars  Unknown properties
  • 8. Identifying Events in Social Media  Timeliness  Real-time  Retrospective  (Prospective)  Content discovery  Known properties  Event databases (e.g., Upcoming, Eventful)  Keyword triggers (e.g, “earthquake”)  Shared calendars  Unknown properties
  • 9. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown
  • 10. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Twitter new event detection [Petrović et al. NAACL’10]
  • 11. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Twitter new event detection [Petrović et al. NAACL’10] Event detection on Flickr [Chen and Roy CIKM’09]
  • 12. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Earthquake prediction using Twitter [Sakaki et al. WWW’10] Twitter new event detection [Petrović et al. NAACL’10] Event detection on Flickr [Chen and Roy CIKM’09]
  • 13. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Earthquake prediction using Twitter [Sakaki et al. WWW’10] Twitter new event detection [Petrović et al. NAACL’10] Event detection on Flickr [Chen and Roy CIKM’09] Organization of YouTube concert videos [Kennedy and Naaman WWW’09]
  • 14. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown
  • 15. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Surfacing events on Twitter
  • 16. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter
  • 17. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter Identifying Twitter content for planned events
  • 18. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter Identifying Twitter content for planned events Connecting events across sites (e.g., YouTube, Picasa)
  • 19. Twitter Content  Streams of textual messages  Brief content (140 characters)  Communicated to network of followers
  • 20. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am
  • 21. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 22. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 23. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 24. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 25. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 26. Identifying Events on Twitter  Challenges:  Wide variety of topics, not all related to events (e.g., morning greetings, “thank you” messages)  Low quality text: abbreviations, unconventional language, riddled with typos, grammatically incorrect  Opportunities:  Content generated in real-time as events happen  Time and location information
  • 27. Identifying Events on Twitter  Challenges:  Wide variety of topics, not all related to events (e.g., morning greetings, “thank you” messages)  Low quality text: abbreviations, unconventional language, riddled with typos, grammatically incorrect  Opportunities:  Content generated in real-time as events happen  Time and location information
  • 28. Events on Twitter  Types of events on Twitter  Exogenous: Real-world occurrences (e.g., Superbowl, “Lost” finale)  Endogenous: Specific to the Twitter-verse (e.g., #thingsyoushouldntsay meme, RT statement by Lady Gaga)  Event:  One or more terms and a time period  Volume of messages posted for the terms in the time period exceeds some expected level of activity
  • 29. Events on Twitter  Types of events on Twitter  Exogenous: Real-world occurrences (e.g., Superbowl, “Lost” finale)  Endogenous: Specific to the Twitter-verse (e.g., #thingsyoushouldntsay meme, RT statement by Lady Gaga)  Event:  One or more terms and a time period  Volume of messages posted for the terms in the time period exceeds some expected level of activity
  • 30. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 31. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 32. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 33. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 34. Surfacing Event Content on Twitter Tweets
  • 35. Surfacing Event Content on Twitter Tweets
  • 36. Surfacing Event Content on Twitter Tweet Clusters Tweets
  • 37. Surfacing Event Content on Twitter Tweet Clusters Tweets
  • 38. Surfacing Event Content on Twitter Tweet Clusters Tweets Event Clusters
  • 39. Surfacing Event Content on Twitter Tweet Clusters Tweets Event Clusters
  • 40. Surfacing Event Content on Twitter Tweet Clusters Tweets Event Clusters
  • 41. Surfacing Event Content on Twitter Tweet Clusters Tweets Event Clusters Selected Tweets
  • 42. Organizing Tweets in Real-Time  Order tweets by post time  Use TF-IDF vector representation of textual content  Stop word elimination  Stemming  Enhanced weight for hashtags (#tag)  IDF computed over past data  Separate tweets by location  Focus on tweets from NYC  Different locations can be processed in parallel
  • 43. Organizing Tweets in Real-Time  Order tweets by post time  Use TF-IDF vector representation of textual content  Stop word elimination  Stemming  Enhanced weight for hashtags (#tag)  IDF computed over past data  Separate tweets by location  Focus on tweets from NYC  Different locations can be processed in parallel
  • 44. Organizing Tweets in Real-Time  Order tweets by post time  Use TF-IDF vector representation of textual content  Stop word elimination  Stemming  Enhanced weight for hashtags (#tag)  IDF computed over past data  Separate tweets by location  Focus on tweets from NYC  Different locations can be processed in parallel
  • 45. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for  Event identification in textual news [Allan et al. 1998]  News event detection on Twitter [Sankaranarayanan et al. 2009]  Does not require a priori knowledge of number of clusters  Known fragmentation issue, often solved with a periodic second pass
  • 46. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for  Event identification in textual news [Allan et al. 1998]  News event detection on Twitter [Sankaranarayanan et al. 2009]  Does not require a priori knowledge of number of clusters  Known fragmentation issue, often solved with a periodic second pass
  • 47. Overview of Cluster-based Approach  Group similar tweets via online clustering  Compute statistics of cluster content  Top terms (e.g., [earthquake, haiti])  Number of documents per hour  …  Use cluster-level features to identify event clusters  Single feature with threshold (e.g., increase in volume over time-window)  Trained classification model
  • 48. Overview of Cluster-based Approach  Group similar tweets via online clustering  Compute statistics of cluster content  Top terms (e.g., [earthquake, haiti])  Number of documents per hour  …  Use cluster-level features to identify event clusters  Single feature with threshold (e.g., increase in volume over time-window)  Trained classification model
  • 49. Overview of Cluster-based Approach  Group similar tweets via online clustering  Compute statistics of cluster content  Top terms (e.g., [earthquake, haiti])  Number of documents per hour  …  Use cluster-level features to identify event clusters  Single feature with threshold (e.g., increase in volume over time-window)  Trained classification model
  • 50. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 51. Social Interaction Features  Retweets  RT @username  Often characterize Twitter-specific events  Replies  Tweet starts with @username  Possible indication of non-event content  Mentions  @username anywhere in the tweet  Reference to twitter users that might be part of an event
  • 52. Social Interaction Features  Retweets  RT @username  Often characterize Twitter-specific events  Replies  Tweet starts with @username  Possible indication of non-event content  Mentions  @username anywhere in the tweet  Reference to twitter users that might be part of an event
  • 53. Social Interaction Features  Retweets  RT @username  Often characterize Twitter-specific events  Replies  Tweet starts with @username  Possible indication of non-event content  Mentions  @username anywhere in the tweet  Reference to twitter users that might be part of an event
  • 54. Topic Coherence Intuition: clusters with strong inter-document similarity may contain event information Class Today Early Work Sleep Start I’m gonna do my best to go sleep during all my classes today =) Starting work early today. Looking fwd to cooking class tonight! Today starts the rest of my life… Katie Couric President Obama Interview CBS Katie Couric Interview With President Obama http://bit.ly/bRsGPo The Katie Couric-President Obama interview has now begun on CBS Katie Couric interviews President Obama during CBS' Super Bowl pregame coverage
  • 55. Trending Behavior  Trending characteristics of top terms in cluster:  Exponential fit  Deviation from expected volume Volume over time for the term “valentine” time documents time (hours)
  • 56. Twitter-Centric Event Features  Tagging behavior  Multi-word tags (e.g., #myhomelesssignwouldsay)  Percentage of tagged tweets  Top term is a tag  …  Retweeting  Percentage of messages with RT @  Percentage of messages from top RTed tweet  …
  • 57. Twitter-Centric Event Features  Tagging behavior  Multi-word tags (e.g., #myhomelesssignwouldsay)  Percentage of tagged tweets  Top term is a tag  …  Retweeting  Percentage of messages with RT @  Percentage of messages from top RTed tweet  …
  • 58. Event Classifier  Use features to build a classifier  Human-annotated training data  SVM model (selected during training phase)  Alternative classification modes:  RW-Event: real-world event vs. rest  TC-Event: event (real-world or Twitter-centric) vs. non- event
  • 59. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 60. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 61. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 63. Event Content Selection Tiger Woods Apology Tiger Woods to make a public apology Friday and talk about his future in golf. Tiger Woods Returns To Golf - Public Apology http://bit.ly/9Ui5jx Tiger woods y'all,tiger woods y'all,ah tiger woods y'all Tiger Woods Hugs: http://tinyurl.com/yhf4 uzw Wedge wars upstage Watson v Woods: BBC Sport (blog)
  • 64. Event Content Selection Tiger Woods Apology Tiger Woods to make a public apology Friday and talk about his future in golf. Tiger Woods Returns To Golf - Public Apology http://bit.ly/9Ui5jx Tiger woods y'all,tiger woods y'all,ah tiger woods y'all Tiger Woods Hugs: http://tinyurl.com/yhf4 uzw Wedge wars upstage Watson v Woods: BBC Sport (blog)
  • 65. Event Content Selection  Challenges:  Clusters contain noise  Relevant tweets might have poor quality text  Relevant, high quality tweets might not be interesting  For each tweet and a given event evaluate  Quality  Relevance  Usefulness
  • 66. Event Content Selection  Challenges:  Clusters contain noise  Relevant tweets might have poor quality text  Relevant, high quality tweets might not be interesting  For each tweet and a given event evaluate  Quality  Relevance  Usefulness
  • 67. Centrality Based Tweet Selection  Centroid Cosine similarity of each tweet to cluster centroid  Degree  Tweets are nodes  Tweets are connected if their similarity is above a threshold  Compute degree centrality of each node  LexRank [Erkan and Radev 2004]  Same graph structure as Degree method  Central tweets are similar to other central tweets
  • 68. Centrality Based Tweet Selection  Centroid Cosine similarity of each tweet to cluster centroid  Degree  Tweets are nodes  Tweets are connected if their similarity is above a threshold  Compute degree centrality of each node  LexRank [Erkan and Radev 2004]  Same graph structure as Degree method  Central tweets are similar to other central tweets
  • 69. Centrality Based Tweet Selection  Centroid Cosine similarity of each tweet to cluster centroid  Degree  Tweets are nodes  Tweets are connected if their similarity is above a threshold  Compute degree centrality of each node  LexRank [Erkan and Radev 2004]  Same graph structure as Degree method  Central tweets are similar to other central tweets
  • 70. Experimental Setup: Data  >2,600,000 tweets, collected via Twitter API  Location: New York City area Indicated on user profile  Time: February 2010  First week used to calibrate statistics  Second week used for training/validation  Third and fourth weeks used for testing
  • 71. Experimental Setup: Data  >2,600,000 tweets, collected via Twitter API  Location: New York City area Indicated on user profile  Time: February 2010  First week used to calibrate statistics  Second week used for training/validation  Third and fourth weeks used for testing
  • 72. Experimental Setup: Data  >2,600,000 tweets, collected via Twitter API  Location: New York City area Indicated on user profile  Time: February 2010  First week used to calibrate statistics  Second week used for training/validation  Third and fourth weeks used for testing
  • 73. Experimental Setup: Training  Data:  504 clusters  Fastest growing clusters/hour in second week of February 2010  Labels:  Real-world event (e.g., [superbowl,colts,saints,sb44])  Twitter-specific event (e.g., [uknowubrokewhen,money,job])  Non-event (e.g., [happy,love,lol])  Ambiguous cluster (e.g., [south,park,west,sxsw,cartman])
  • 74. Experimental Setup: Training  Data:  504 clusters  Fastest growing clusters/hour in second week of February 2010  Labels:  Real-world event (e.g., [superbowl,colts,saints,sb44])  Twitter-specific event (e.g., [uknowubrokewhen,money,job])  Non-event (e.g., [happy,love,lol])  Ambiguous cluster (e.g., [south,park,west,sxsw,cartman])
  • 75. Experimental Setup: Testing  Baselines:  Naïve Bayes text classification (NB-Text)  Fastest-growing clusters per hour  Classifiers:  RW-Event  TC-Event  400 clusters  5 hours  Top 20 clusters per hour according to RW-Event, TC- Event, Fastest-growing, random
  • 76. Experimental Setup: Testing  Baselines:  Naïve Bayes text classification (NB-Text)  Fastest-growing clusters per hour  Classifiers:  RW-Event  TC-Event  400 clusters  5 hours  Top 20 clusters per hour according to RW-Event, TC- Event, Fastest-growing, random
  • 77. Experimental Setup: Testing  Baselines:  Naïve Bayes text classification (NB-Text)  Fastest-growing clusters per hour  Classifiers:  RW-Event  TC-Event  400 clusters  5 hours  Top 20 clusters per hour according to RW-Event, TC- Event, Fastest-growing, random
  • 78. Experimental Methodology: Event Classification  Classification accuracy  10-fold cross validation  Separate test set of randomly chosen tweets  Event surfacing  Top events per hour for each technique  Evaluation:  Precision@K  NDCG@K
  • 79. Experimental Methodology: Event Classification  Classification accuracy  10-fold cross validation  Separate test set of randomly chosen tweets  Event surfacing  Top events per hour for each technique  Evaluation:  Precision@K  NDCG@K
  • 80. Identified Events Description Keywords Senator Evan Bayh's Retirement bayh, evan, senate, congress, retire Westminster Dog Show westminster, dog, show, club, kennel Obama’s Meeting with the Dalai Lama lama, dalai, meet, obama, china NYC Toy Fair toyfairny, starwars, hasbro, lego, toy Marc Jacobs Fashion Show jacobs, marc, nyfw, show, fashion A sample of events identified by our classifiers on the test set
  • 81. Classification Performance (F-measure)  RW-Event classifier is more effective at discriminating between real-world events and rest of Twitter data Classifier Validation Test NB-Text 0.785 0.702 RW-Event 0.849 0.837 TC-Event 0.875 0.789
  • 82. Precision@K Evaluation 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 15 20 Precision Number of Clusters (K) RW-Event TC-Event Fastest Random
  • 83. NDCG@K Evaluation 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 15 20 NDCG Number of Clusters (K) RW-Event TC-Event Fastest Random
  • 84. Experimental Methodology: Content Selection  50 event clusters  Randomly selected from test set  5 top tweets per event for each: Centroid, Degree, LexRank  Labeled on a 1-4 scale  Quality: excellent (4)  poor (1)  Relevance: clearly relevant (4)  not relevant (1)  Usefulness: clearly useful (4)  not useful (1)
  • 85. Selected Tweets: Example Method Tweet Centroid Video: Tiger regretful; unsure about return to golf - Main Line ...: (AP) Tiger Woods publicly apologized Friday... http://bit.ly/dAO41N Degree Watson: Woods needs to show humility upon return (AP): Tom Watson says Tiger Woods needs to "show some humility to... http://bit.ly/cHVH7x LexRank RT @EricStangel: Tiger Woods statement: And now for Elin's repsonse.... A sample of tweets selected by different centrality methods
  • 86. Content Selection Results  Average scores over all events  High quality and relevance (>3) for both Degree and Centroid  Centroid only method with high usefulness Method Quality Relevance Usefulness LexRank 3.444 2.984 2.608 Degree 3.536 3.156 2.802 Centroid 3.636 3.694 3.474
  • 87. Preferred Method per Event  Centroid is the preferred method across all metrics For usefulness, Centroid tweets preferred more than 2:1 compared to Degree, 4:1 compared to LexRank Method Quality Relevance Usefulness LexRank 22.66% 16.33% 12% Degree 31.66% 25.33% 28% Centroid 45.66% 58.33% 60%
  • 88. Conclusions Techniques for discovering, organizing, and presenting social media from real-world events  Event classifiers  Important to capture features of Twitter-specific events in order to reveal the real-world events  Effectively surfaced real-world events in an unsupervised setting  Content selection  Similarity to centroid technique better at selecting event content  There is relevant and useful event content on Twitter!
  • 89. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter Identifying Twitter content for planned events Connecting events across sites (e.g., YouTube, Picasa)
  • 90. Learning Similarity Metrics for Event Identification in Social Media (WSDM ’10) Ctitle Ctags Ctime
  • 91. Combine similarities Learning Similarity Metrics for Event Identification in Social Media (WSDM ’10) Wtitle Wtags Wtime f(C,W) Ctitle Ctags Ctime Learned in a training step
  • 92. Combine similarities Learning Similarity Metrics for Event Identification in Social Media (WSDM ’10) Wtitle Wtags Wtime f(C,W) Ctitle Ctags Ctime Final clustering solution Learned in a training step
  • 93. Identifying Tweets for Known Events
  • 94. Identifying Tweets for Known Events
  • 95. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter Identifying Twitter content for planned events Connecting events across sites (e.g., YouTube, Picasa)
  • 96. Thank you!  Pablo Barrio  David Elson  Dan Iter  Yves Petinot  Sara Rosenthal  Gonçalo Simões  Matt Solomon  Kapil Thadani