The document discusses mining Twitter data in real-time for trend and information discovery. It describes two works: 1) Classifying emerging trending topics on Twitter with 78.4% accuracy using social features of tweets rather than text. 2) Summarizing live events tweeted on Twitter, such as soccer games, with over 80% precision and recall by detecting sub-events and selecting representative tweets. The outlook discusses further analyzing trend types and evaluating the summarization approach on other event types.
Capstone slidedeck for my capstone project part 2.pdf
Mining Twitter for Real-Time Trend and Information Discovery
1. Mining Twitter for real-time trend and information
discovery
Yahoo! Research Barcelona
Arkaitz Zubiaga
NLP & IR Group @ UNED
December 19th, 2011
2. Motivation
Index
1
Motivation
2
Our Work (I): Classification of Trending Topics
3
Our Work (II): Real-Time Summarization of Events
4
Outlook
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
2 / 43
3. Motivation
Twitter
Twitter is a microblogging service with over 200 million users.
Users share short messages of up to 140 characters (tweets).
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
3 / 43
7. Motivation
Increase of activity on Twitter
As of October 2011, Twitter received 250 million tweets per day.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
7 / 43
9. Motivation
Usefulness of Twitter
Twitter provides...
1
...large amounts of data in real-time,
2
from a wide variety of sources,
3
with the ability to spread rapidly.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
9 / 43
11. Motivation
Using Twitter for... following events
(1) Live-tweeting about and following events.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
11 / 43
12. Motivation
Using Twitter for... helping others
(2) Helping others, as in natural disasters.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
12 / 43
13. Motivation
Using Twitter for... finding out about news
and (3) Finding out about breaking news.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
13 / 43
14. Motivation
Twitter on the media
Lots of researchers are analyzing tweets.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
14 / 43
15. Motivation
Trends on Twitter
The news about the Japan earthquake broke on Twitter.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
15 / 43
17. Motivation
Research on Twitter
Most of the research on Twitter focus on the analysis of streams after
they happened.
Very little research deals with the real-time analysis of streams.
Our goal: How can we mine Twitter streams to acquire real-time
knowledge about events and trends?
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
17 / 43
18. Our Work (I): Classification of Trending Topics
Index
1
Motivation
2
Our Work (I): Classification of Trending Topics
3
Our Work (II): Real-Time Summarization of Events
4
Outlook
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
18 / 43
19. Our Work (I): Classification of Trending Topics
Trending Topics on Twitter
Trending topics reflect the top conversations being discussed on
Twitter more than usually.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
19 / 43
20. Our Work (I): Classification of Trending Topics
What produces trending topics?
What kinds of events leverage those trending topics?
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
20 / 43
21. Our Work (I): Classification of Trending Topics
Typology of Trending Topics
News: Japan earthquake.
Current events: a soccer game.
Memes: funny and viral ideas.
Commemoratives: World AIDS Day.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
21 / 43
22. Our Work (I): Classification of Trending Topics
Goal
Find out the type of a trending topic as soon as it emerges.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
22 / 43
23. Our Work (I): Classification of Trending Topics
Dataset
1,036 unique trending topics, with up to 1,500 associated
tweets as soon as they trended.
Manual classification of trending topics:
616 current events.
251 memes.
142 news.
27 commemoratives.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
23 / 43
24. Our Work (I): Classification of Trending Topics
Experiment Settings
Support Vector Machines (one-against-all)
500 trends for the training set.
10 runs.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
24 / 43
25. Our Work (I): Classification of Trending Topics
Representation of Trending Topics
2 different representation approaches:
Twitter features: 15 straightforward language-independent
features that rely on the social spread of trends.
Bag-of-words: Text of tweets (TF).
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
25 / 43
26. Our Work (I): Classification of Trending Topics
Results
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
26 / 43
27. Our Work (I): Classification of Trending Topics
Results
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
27 / 43
28. Our Work (I): Classification of Trending Topics
Main findings
Trending topics can accurately (78.4%) be categorized using social
features:
Outperforming use of textual content.
Without making use of external data.
In real-time as the trending topic emerges.
Arkaitz Zubiaga, Damiano Spina, V´
ıctor Fresno, and Raquel Mart´
ınez.
2011. Classifying trending topics: a typology of conversation triggers on
Twitter. CIKM 2011.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
28 / 43
29. Our Work (II): Real-Time Summarization of Events
Index
1
Motivation
2
Our Work (I): Classification of Trending Topics
3
Our Work (II): Real-Time Summarization of Events
4
Outlook
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
29 / 43
30. Our Work (II): Real-Time Summarization of Events
Events on Twitter
When users live-tweet about events:
They produce vast amounts of tweets about events.
Users want to follow what others say.
Users cannot follow the overwhelming amounts of tweets.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
30 / 43
31. Our Work (II): Real-Time Summarization of Events
Stream summarization
Can we summarize streams of tweets in such a way that:
Users receive a reduced stream that they can follow?
Users do not miss any key sub-event occurred during the event?
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
31 / 43
32. Our Work (II): Real-Time Summarization of Events
Study of soccer games
Copa America 2011 (July 1-26, 2011):
26 soccer games.
11k-70k tweets per game.
Tweets are written in 30 languages.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
32 / 43
33. Our Work (II): Real-Time Summarization of Events
Gold Standard
Live reports gathered from Yahoo! Sports.
Yahoo! journalists provide annotations for:
Goals.
Penalties.
Red Cards.
Disallowed Goals.
Game Starts, Ends, Stops & Resumptions.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
33 / 43
34. Our Work (II): Real-Time Summarization of Events
Histogram of a Soccer Game
2500
tweet rate
2000
1500
1000
1310864000
1310862000
1310860000
1310858000
1310856000
1310854000
500
time elapsed
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
34 / 43
35. Our Work (II): Real-Time Summarization of Events
Summarization of soccer games
2-step summarization:
1
Sub-event detection.
2
Tweet selection.
Sub-event
Detection
Tweet
Selection
tweet
tweet
tweet
summary
tweets stream
real-time
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
35 / 43
36. Our Work (II): Real-Time Summarization of Events
1st Step: Sub-event Detection
Increase [Zhao et al., 2011]: a sub-event occurred when a sudden
increase is given in the tweeting rate (1.7 as much as the previous
rate).
Outliers: learns from audience. High tweeting rates as
compared to rates seen so far will be considered sub-events (90%
percentile).
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
36 / 43
37. Our Work (II): Real-Time Summarization of Events
1st Step: Results
Increase
Outliers
P
0.29
0.51
R
0.81
0.84
F1
0.41
0.63
#
45.4
25.6
Increase-based approach provides more sub-events, with many FPs
(recall-based).
Outlier-based approach (rather based on outstanding tweeting rates)
improves in P and R.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
37 / 43
38. Our Work (II): Real-Time Summarization of Events
2nd Step: Tweet Selection
Each term appearing in tweets in a given timeframe is given a weight
according to:
Frequency (TF).
Language Models (KLD).
These weightings enable to choose a representative tweet, as the tweet
with higher value adding up weights of its terms.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
38 / 43
40. Our Work (II): Real-Time Summarization of Events
Main findings
Use of state-of-the-art text analysis methods generates accurate
summaries:
With precision and recall values above 80% (100% for key
sub-events).
In real-time as the game is being played.
In 3 different languages (es, en, pt).
Without need of external data.
Damiano Spina, Arkaitz Zubiaga, Enrique Amig´, Julio Gonzalo. Towards
o
Real-Time Summarization of Events from Twitter Streams. To Appear.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
40 / 43
41. Outlook
Index
1
Motivation
2
Our Work (I): Classification of Trending Topics
3
Our Work (II): Real-Time Summarization of Events
4
Outlook
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
41 / 43
42. Outlook
Outlook
Work 1:
Further dig into each type of trending topic, in order to look for
subtypes of trends.
Work 2:
Evaluate the performance of the summarizer on other kinds of
scheduled events (award ceremonies, keynote talks,...)
Evaluate novelty of information garnered from tweets.
Arkaitz Zubiaga (UNED)
Real-time mining of Twitter
December 19th, 2011
42 / 43