Peaks and Persistance

•

3 j'aime•536 vues

My CSCW2011 talk on Peaks and Persistence...understanding temporality in microblog event analysis. Full paper here: http://research.yahoo.com/pub/3419

Technologie

Peaks & Persistence
David A. Shamma • Yahoo! Research
Lyndon Kennedy • Yahoo! Labs
Elizabeth F. Churchill • Yahoo! Research

The USA Presidential
A set of tweets.
Inauguration of 2009

Tweet Stream circa
2009
Data Mining Feed
Constant Data Rate
600 Tweets per minute
90 minutes
54,000 Tweets from 1.5 hours
2M mobile streams

Conversational Volume
Inauguration 2009
by Minute

How do we begin to extract what happened and
what people talked about?

“Trending Topics” &
memes are
insufficient
Too general. Highlights
basically what you would want
to stoplist.

M. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. Chi. Eddi: Interactive
Topic-based Browsing of Social Status Streams. ICWSM, 2010.

S. Vieweg, A. L. Hughes, K. Starbird, and L. Palen. Microblogging during two
natural hazards events: what twitter may contribute to situational awareness. In
CHI ’10: Proceedings of the 28th international conference on Human factors in
computing systems, pages 1079–1088, New York, NY, USA, 2010. ACM.

A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding
microblogging usage and communities. In WebKDD/SNA-KDD ’07: Proceedings of
the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social
network analysis, pages 56–65, New York, NY, USA, 2007. ACM.

B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps about twitter. In WOSP ’08:
Proceedings of the first workshop on Online social networks, pages 19–24, New
York, NY, USA, 2008. ACM.

traditional tf•idf won’t work: a tweet is a very
small document...too small. The vector is too flat.

“OMG - Obama just messed up the oath -
AWESOME! he’s human!”
Create pseudo-documents from 5-minute slices

No signiﬁcant 
occurrence of a 
term.

Term occurs 
signiﬁcantly

Find terms which are only
windows are 5 minute slices
particular to a window

Find terms which are only
windows are 5 minute slices
particular to a window

No signiﬁcant 
occurrence of 
“aretha”

Find terms which are only
particular to a window

Cluster significant co-
occurring terms per
pseudo document
Aretha
Franklin
Sings
Bow
Hat

Not everything of interest “spikes”.
Not everything of interest is a “trend”.
Not everything of interest is “salient”.

t0 = Time  t2 = Time 
before Max a>er Max

t1 = Time at 
Max

Three slices Do this for every term

$t0 t2 { { t0 = Time before  t2 = Time a>er  peak peak { t_peak = Time at  Max Average score from t2: persistence = t2/t0 Average score from t0$

Sustained/Persisted
Topics persist over time
Interest

0.35

0.25

Terms are not sailent However they persist.

People announce

(12:05) Bastille71: OMG - Obama
just messed up the oath -
AWESOME! he’s human!
(12:07) ryantherobot: LOL Obama
messed up his inaugural oath
twice! regardless, Obama is the
president today! whoooo!
(12:46) mattycus: RT @deelah: it
wasn’t Obama that messed the
oath, it was Chief Justice
Roberts: http://is.gd/gAVo
(12:53) dawngoldberg:
@therichbrooks He flubbed the
oath because Chief Justice
screwed up the order of the
words.

People reply

(12:05) Bastille71: OMG - Obama
just messed up the oath -
AWESOME! he’s human!
(12:07) ryantherobot: LOL Obama
messed up his inaugural oath
twice! regardless, Obama is the
president today! whoooo!
(12:46) mattycus: RT @deelah: it
wasn’t Obama that messed the
oath, it was Chief Justice
Roberts: http://is.gd/gAVo
(12:53) dawngoldberg:
@therichbrooks He flubbed the
oath because Chief Justice
screwed up the order of the
words.

Kanye : blue line
from the MTV VMA
a••hole : red line

Peaky table of contents Again, we find conversation not
and “Kanye” salient

What about complex events? What about human exchanges makes
things iconic?

Recommandé

Price mechanism in the rmg industry of bangladeshAyman Sadiq

Re Branding of OTOBIAyman Sadiq

Telecommunication basicsYoohyun Kim

Top Business Competitions in BangladeshAyman Sadiq

telecommunication-pptsecomps

Things That Don't Matter in Your Presentation!Ayman Sadiq

Tell a Story! [Be the Batman]Ayman Sadiq

Introduction to microsoft word 2007Abdul-rahaman Bin Abubakar Suleman

Recommandé

Price mechanism in the rmg industry of bangladeshAyman Sadiq

Re Branding of OTOBIAyman Sadiq

Telecommunication basicsYoohyun Kim

Top Business Competitions in BangladeshAyman Sadiq

telecommunication-pptsecomps

Things That Don't Matter in Your Presentation!Ayman Sadiq

Tell a Story! [Be the Batman]Ayman Sadiq

Introduction to microsoft word 2007Abdul-rahaman Bin Abubakar Suleman

Recommending Actions, Not ContentDavid Shamma

Staying together: Understanding People and Media in Synchronous Connected Sy...David Shamma

Human to Dancer Interaction: Designing for Embodied Performances in a Partici...David Shamma

SFUSGSDavid Shamma

A Movie And A ChatDavid Shamma

Tweet the DebatesDavid Shamma

Synergies Between Search and Social MetricsDavid Shamma

Spinning OnlineDavid Shamma

Video Creativity beyond DisseminationDavid Shamma

Watch What I WatchDavid Shamma

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Contenu connexe

Plus de David Shamma

Recommending Actions, Not ContentDavid Shamma

Staying together: Understanding People and Media in Synchronous Connected Sy...David Shamma

Human to Dancer Interaction: Designing for Embodied Performances in a Partici...David Shamma

SFUSGSDavid Shamma

A Movie And A ChatDavid Shamma

Tweet the DebatesDavid Shamma

Synergies Between Search and Social MetricsDavid Shamma

Spinning OnlineDavid Shamma

Video Creativity beyond DisseminationDavid Shamma

Watch What I WatchDavid Shamma

Plus de David Shamma (10)

Recommending Actions, Not Content

Staying together: Understanding People and Media in Synchronous Connected Sy...

Human to Dancer Interaction: Designing for Embodied Performances in a Partici...

SFUSGS

A Movie And A Chat

Tweet the Debates

Synergies Between Search and Social Metrics

Spinning Online

Video Creativity beyond Dissemination

Watch What I Watch

Dernier

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

How to convert PDF to text with Nanonetsnaman860154

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Scaling API-first – The story of a global engineering organizationRadu Cotescu

AI as an Interface for Commercial BuildingsMemoori

Dernier (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Pigging Solutions in Pet Food Manufacturing

Handwritten Text Recognition for manuscripts and early printed texts

How to Remove Document Management Hurdles with X-Docs?

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Maximizing Board Effectiveness 2024 Webinar.pptx

Injustice - Developers Among Us (SciFiDevCon 2024)

The 7 Things I Know About Cyber Security After 25 Years | April 2024

How to Troubleshoot Apps for the Modern Connected Worker

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Salesforce Community Group Quito, Salesforce 101

IAC 2024 - IA Fast Track to Search Focused AI Solutions

08448380779 Call Girls In Civil Lines Women Seeking Men

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

How to convert PDF to text with Nanonets

Azure Monitor & Application Insight to monitor Infrastructure & Application

Breaking the Kubernetes Kill Chain: Host Path Mount

Scaling API-first – The story of a global engineering organization

AI as an Interface for Commercial Buildings

Peaks and Persistance

1. Peaks & Persistence David A. Shamma • Yahoo! Research Lyndon Kennedy • Yahoo! Labs Elizabeth F. Churchill • Yahoo! Research

2. The USA Presidential A set of tweets. Inauguration of 2009

3. Tweet Stream circa 2009 Data Mining Feed Constant Data Rate 600 Tweets per minute 90 minutes 54,000 Tweets from 1.5 hours 2M mobile streams

4. Volume by minute Inauguration 2009

5. Conversational Volume Inauguration 2009 by Minute

6. How do we begin to extract what happened and what people talked about?

7. “Trending Topics” & memes are insufficient Too general. Highlights basically what you would want to stoplist.

8. M. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. Chi. Eddi: Interactive Topic-based Browsing of Social Status Streams. ICWSM, 2010. S. Vieweg, A. L. Hughes, K. Starbird, and L. Palen. Microblogging during two natural hazards events: what twitter may contribute to situational awareness. In CHI ’10: Proceedings of the 28th international conference on Human factors in computing systems, pages 1079–1088, New York, NY, USA, 2010. ACM. A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In WebKDD/SNA-KDD ’07: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pages 56–65, New York, NY, USA, 2007. ACM. B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps about twitter. In WOSP ’08: Proceedings of the first workshop on Online social networks, pages 19–24, New York, NY, USA, 2008. ACM.

9. traditional tf•idf won’t work: a tweet is a very small document...too small. The vector is too flat. “OMG - Obama just messed up the oath - AWESOME! he’s human!” Create pseudo-documents from 5-minute slices

10. No signiﬁcant  occurrence of a  term. Term occurs  signiﬁcantly Find terms which are only windows are 5 minute slices particular to a window

11. Find terms which are only windows are 5 minute slices particular to a window

12. No signiﬁcant  occurrence of  “aretha” Find terms which are only particular to a window

13. Cluster significant co- occurring terms per pseudo document Aretha Franklin Sings Bow Hat

14. Not everything of interest “spikes”. Not everything of interest is a “trend”. Not everything of interest is “salient”.

15. t0 = Time  t2 = Time  before Max a>er Max t1 = Time at  Max Three slices Do this for every term

16. t0 t2 { { t0 = Time before  t2 = Time a>er  peak peak { t_peak = Time at  Max Average score from t2: persistence = t2/t0 Average score from t0

17. Sustained/Persisted Topics persist over time Interest

18. 0.35 0.25 Terms are not sailent However they persist.

19. People announce (12:05) Bastille71: OMG - Obama just messed up the oath - AWESOME! he’s human! (12:07) ryantherobot: LOL Obama messed up his inaugural oath twice! regardless, Obama is the president today! whoooo! (12:46) mattycus: RT @deelah: it wasn’t Obama that messed the oath, it was Chief Justice Roberts: http://is.gd/gAVo (12:53) dawngoldberg: @therichbrooks He flubbed the oath because Chief Justice screwed up the order of the words.

20. People reply (12:05) Bastille71: OMG - Obama just messed up the oath - AWESOME! he’s human! (12:07) ryantherobot: LOL Obama messed up his inaugural oath twice! regardless, Obama is the president today! whoooo! (12:46) mattycus: RT @deelah: it wasn’t Obama that messed the oath, it was Chief Justice Roberts: http://is.gd/gAVo (12:53) dawngoldberg: @therichbrooks He flubbed the oath because Chief Justice screwed up the order of the words.

21. 2009 MTV VMA 2.4 million tweets

22. Persisted terms from the MTV VMA

23. Kanye : blue line from the MTV VMA a••hole : red line

24. Peaky table of contents Again, we find conversation not and “Kanye” salient

25. Peaky table of contents Again, we find conversation not and “Kanye” salient

26. What about complex events? What about human exchanges makes things iconic?

27. Thanks! Extra thanks to my co-authors!

Notes de l'éditeur

And now for the quant method part of the show. Spend more time setting the context for the data. Story and cultural background is important here.\n
Emphasize what this is as a broadcast media event. And its a big deal.\n
\n
\n
\n
\n
There is a need to look beyond memes and trends. These are actually good for pulling out over-arching themes, but poor at describing the event.\n
\n
So for every 5 minutes, we ignore who tweeted what...just pool it all together.\n
\n
\n
\n
\n
Not everything &#x201C;trends&#x201D; not everything is a &#x201C;spike&#x201D;\n
\n
\n
\n
\n
\n
\n
Explain this too. What is this...\n
\n
\n
\n
Its important over time, even if its not a trend or if it trends momentarily.\n\n\n
We&#x2019;ve taken this through mechanics and techniques; this augments SNA. We are pushing the bounds of SNA in regards to media and event. The importance of an event needs to be measured over time, not just in the right here right now. We know the revolution will be something to remember. Its what happened in there that we need to pull out.\n
thanks\n\nReviewer 3 raises some concerns about this proposed variation of TFIDF with particular attention to why the TF metric is defined at the document level. We count the number of tweets that a term appears in (rather than raw total frequency of the term) since we are focused on gauging the number of *people* using a term at a given time. A raw count of occurrence can be biased by a single user repeating a term many times in one tweet, which we did observe. R3 also noticed that our TFIDF variant is sensitive to very infrequent terms (which is also the case with traditional TFIDF). We also observed this problem and handled it by treating terms with frequencies below a certain threshold as stop words.\n