SlideShare une entreprise Scribd logo
1  sur  27
Peaks & Persistence
David A. Shamma • Yahoo! Research
Lyndon Kennedy • Yahoo! Labs
Elizabeth F. Churchill • Yahoo! Research
The USA Presidential
                       A set of tweets.
Inauguration of 2009
Tweet Stream circa
2009
Data Mining Feed
Constant Data Rate
600 Tweets per minute
90 minutes
54,000 Tweets from 1.5 hours
2M mobile streams
Volume by minute   Inauguration 2009
Conversational Volume
                        Inauguration 2009
            by Minute
How do we begin to extract what happened and
what people talked about?
“Trending Topics” &
memes are
insufficient
Too general. Highlights
basically what you would want
to stoplist.
M. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. Chi. Eddi: Interactive
Topic-based Browsing of Social Status Streams. ICWSM, 2010.

S. Vieweg, A. L. Hughes, K. Starbird, and L. Palen. Microblogging during two
natural hazards events: what twitter may contribute to situational awareness. In
CHI ’10: Proceedings of the 28th international conference on Human factors in
computing systems, pages 1079–1088, New York, NY, USA, 2010. ACM.



A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding
microblogging usage and communities. In WebKDD/SNA-KDD ’07: Proceedings of
the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social
network analysis, pages 56–65, New York, NY, USA, 2007. ACM.

B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps about twitter. In WOSP ’08:
Proceedings of the first workshop on Online social networks, pages 19–24, New
York, NY, USA, 2008. ACM.
traditional tf•idf won’t work: a tweet is a very
small document...too small. The vector is too flat.

“OMG - Obama just messed up the oath -
 AWESOME! he’s human!”
Create pseudo-documents from 5-minute slices
No
significant

occurrence
of
a

          term.




   Term
occurs

   significantly




         Find terms which are only
                                     windows are 5 minute slices
            particular to a window
Find terms which are only
                            windows are 5 minute slices
   particular to a window
No
significant

occurrence
of

     “aretha”




       Find terms which are only
          particular to a window
Cluster significant co-
occurring terms per
pseudo document
Aretha
Franklin
Sings
Bow
Hat
Not everything of interest “spikes”.
Not everything of interest is a “trend”.
Not everything of interest is “salient”.
t0
=
Time
                                  t2
=
Time

before
Max                                   a>er
Max




              t1
=
Time
at

                  Max


              Three slices    Do this for every term
t0                      t2
               {




                                           {
         t0
=
Time
before
              t2
=
Time
a>er

               peak                          peak
                             {

                    t_peak
=
Time
at

                         Max



Average score from t2:
                                           persistence = t2/t0
Average score from t0
Sustained/Persisted
                      Topics persist over time
           Interest
0.35

                             0.25




Terms are not sailent   However they persist.
People announce

(12:05) Bastille71: OMG - Obama
   just messed up the oath -
   AWESOME! he’s human!
(12:07) ryantherobot: LOL Obama
   messed up his inaugural oath
   twice! regardless, Obama is the
   president today! whoooo!
(12:46) mattycus: RT @deelah: it
   wasn’t Obama that messed the
   oath, it was Chief Justice
   Roberts: http://is.gd/gAVo
(12:53) dawngoldberg:
   @therichbrooks He flubbed the
   oath because Chief Justice
   screwed up the order of the
   words.
People reply

(12:05) Bastille71: OMG - Obama
   just messed up the oath -
   AWESOME! he’s human!
(12:07) ryantherobot: LOL Obama
   messed up his inaugural oath
   twice! regardless, Obama is the
   president today! whoooo!
(12:46) mattycus: RT @deelah: it
   wasn’t Obama that messed the
   oath, it was Chief Justice
   Roberts: http://is.gd/gAVo
(12:53) dawngoldberg:
   @therichbrooks He flubbed the
   oath because Chief Justice
   screwed up the order of the
   words.
2009 MTV VMA   2.4 million tweets
Persisted terms   from the MTV VMA
Kanye : blue line
                     from the MTV VMA
a••hole : red line
Peaky table of contents   Again, we find conversation not
           and “Kanye”    salient
Peaky table of contents   Again, we find conversation not
           and “Kanye”    salient
What about complex events? What about human exchanges makes
things iconic?
Thanks! Extra thanks to my co-authors!

Contenu connexe

Plus de David Shamma

Recommending Actions, Not Content
Recommending Actions, Not ContentRecommending Actions, Not Content
Recommending Actions, Not ContentDavid Shamma
 
Staying together: Understanding People and Media in Synchronous Connected Sy...
Staying together:  Understanding People and Media in Synchronous Connected Sy...Staying together:  Understanding People and Media in Synchronous Connected Sy...
Staying together: Understanding People and Media in Synchronous Connected Sy...David Shamma
 
Human to Dancer Interaction: Designing for Embodied Performances in a Partici...
Human to Dancer Interaction: Designing for Embodied Performances in a Partici...Human to Dancer Interaction: Designing for Embodied Performances in a Partici...
Human to Dancer Interaction: Designing for Embodied Performances in a Partici...David Shamma
 
A Movie And A Chat
A Movie And A ChatA Movie And A Chat
A Movie And A ChatDavid Shamma
 
Synergies Between Search and Social Metrics
Synergies Between Search and Social MetricsSynergies Between Search and Social Metrics
Synergies Between Search and Social MetricsDavid Shamma
 
Video Creativity beyond Dissemination
Video Creativity beyond DisseminationVideo Creativity beyond Dissemination
Video Creativity beyond DisseminationDavid Shamma
 
Watch What I Watch
Watch  What  I  WatchWatch  What  I  Watch
Watch What I WatchDavid Shamma
 

Plus de David Shamma (10)

Recommending Actions, Not Content
Recommending Actions, Not ContentRecommending Actions, Not Content
Recommending Actions, Not Content
 
Staying together: Understanding People and Media in Synchronous Connected Sy...
Staying together:  Understanding People and Media in Synchronous Connected Sy...Staying together:  Understanding People and Media in Synchronous Connected Sy...
Staying together: Understanding People and Media in Synchronous Connected Sy...
 
Human to Dancer Interaction: Designing for Embodied Performances in a Partici...
Human to Dancer Interaction: Designing for Embodied Performances in a Partici...Human to Dancer Interaction: Designing for Embodied Performances in a Partici...
Human to Dancer Interaction: Designing for Embodied Performances in a Partici...
 
SFUSGS
SFUSGSSFUSGS
SFUSGS
 
A Movie And A Chat
A Movie And A ChatA Movie And A Chat
A Movie And A Chat
 
Tweet the Debates
Tweet the DebatesTweet the Debates
Tweet the Debates
 
Synergies Between Search and Social Metrics
Synergies Between Search and Social MetricsSynergies Between Search and Social Metrics
Synergies Between Search and Social Metrics
 
Spinning Online
Spinning OnlineSpinning Online
Spinning Online
 
Video Creativity beyond Dissemination
Video Creativity beyond DisseminationVideo Creativity beyond Dissemination
Video Creativity beyond Dissemination
 
Watch What I Watch
Watch  What  I  WatchWatch  What  I  Watch
Watch What I Watch
 

Dernier

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Dernier (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Peaks and Persistance

  • 1. Peaks & Persistence David A. Shamma • Yahoo! Research Lyndon Kennedy • Yahoo! Labs Elizabeth F. Churchill • Yahoo! Research
  • 2. The USA Presidential A set of tweets. Inauguration of 2009
  • 3. Tweet Stream circa 2009 Data Mining Feed Constant Data Rate 600 Tweets per minute 90 minutes 54,000 Tweets from 1.5 hours 2M mobile streams
  • 4. Volume by minute Inauguration 2009
  • 5. Conversational Volume Inauguration 2009 by Minute
  • 6. How do we begin to extract what happened and what people talked about?
  • 7. “Trending Topics” & memes are insufficient Too general. Highlights basically what you would want to stoplist.
  • 8. M. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. Chi. Eddi: Interactive Topic-based Browsing of Social Status Streams. ICWSM, 2010. S. Vieweg, A. L. Hughes, K. Starbird, and L. Palen. Microblogging during two natural hazards events: what twitter may contribute to situational awareness. In CHI ’10: Proceedings of the 28th international conference on Human factors in computing systems, pages 1079–1088, New York, NY, USA, 2010. ACM. A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In WebKDD/SNA-KDD ’07: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pages 56–65, New York, NY, USA, 2007. ACM. B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps about twitter. In WOSP ’08: Proceedings of the first workshop on Online social networks, pages 19–24, New York, NY, USA, 2008. ACM.
  • 9. traditional tf•idf won’t work: a tweet is a very small document...too small. The vector is too flat. “OMG - Obama just messed up the oath - AWESOME! he’s human!” Create pseudo-documents from 5-minute slices
  • 10. No
significant
 occurrence
of
a
 term. Term
occurs
 significantly Find terms which are only windows are 5 minute slices particular to a window
  • 11. Find terms which are only windows are 5 minute slices particular to a window
  • 12. No
significant
 occurrence
of
 “aretha” Find terms which are only particular to a window
  • 13. Cluster significant co- occurring terms per pseudo document Aretha Franklin Sings Bow Hat
  • 14. Not everything of interest “spikes”. Not everything of interest is a “trend”. Not everything of interest is “salient”.
  • 15. t0
=
Time
 t2
=
Time
 before
Max a>er
Max t1
=
Time
at
 Max Three slices Do this for every term
  • 16. t0 t2 { { t0
=
Time
before
 t2
=
Time
a>er
 peak peak { t_peak
=
Time
at
 Max Average score from t2: persistence = t2/t0 Average score from t0
  • 17. Sustained/Persisted Topics persist over time Interest
  • 18. 0.35 0.25 Terms are not sailent However they persist.
  • 19. People announce (12:05) Bastille71: OMG - Obama just messed up the oath - AWESOME! he’s human! (12:07) ryantherobot: LOL Obama messed up his inaugural oath twice! regardless, Obama is the president today! whoooo! (12:46) mattycus: RT @deelah: it wasn’t Obama that messed the oath, it was Chief Justice Roberts: http://is.gd/gAVo (12:53) dawngoldberg: @therichbrooks He flubbed the oath because Chief Justice screwed up the order of the words.
  • 20. People reply (12:05) Bastille71: OMG - Obama just messed up the oath - AWESOME! he’s human! (12:07) ryantherobot: LOL Obama messed up his inaugural oath twice! regardless, Obama is the president today! whoooo! (12:46) mattycus: RT @deelah: it wasn’t Obama that messed the oath, it was Chief Justice Roberts: http://is.gd/gAVo (12:53) dawngoldberg: @therichbrooks He flubbed the oath because Chief Justice screwed up the order of the words.
  • 21. 2009 MTV VMA 2.4 million tweets
  • 22. Persisted terms from the MTV VMA
  • 23. Kanye : blue line from the MTV VMA a••hole : red line
  • 24. Peaky table of contents Again, we find conversation not and “Kanye” salient
  • 25. Peaky table of contents Again, we find conversation not and “Kanye” salient
  • 26. What about complex events? What about human exchanges makes things iconic?
  • 27. Thanks! Extra thanks to my co-authors!

Notes de l'éditeur

  1. And now for the quant method part of the show. Spend more time setting the context for the data. Story and cultural background is important here.\n
  2. Emphasize what this is as a broadcast media event. And its a big deal.\n
  3. \n
  4. \n
  5. \n
  6. \n
  7. There is a need to look beyond memes and trends. These are actually good for pulling out over-arching themes, but poor at describing the event.\n
  8. \n
  9. So for every 5 minutes, we ignore who tweeted what...just pool it all together.\n
  10. \n
  11. \n
  12. \n
  13. \n
  14. Not everything “trends” not everything is a “spike”\n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. Explain this too. What is this...\n
  22. \n
  23. \n
  24. \n
  25. Its important over time, even if its not a trend or if it trends momentarily.\n\n\n
  26. We’ve taken this through mechanics and techniques; this augments SNA. We are pushing the bounds of SNA in regards to media and event. The importance of an event needs to be measured over time, not just in the right here right now. We know the revolution will be something to remember. Its what happened in there that we need to pull out.\n
  27. thanks\n\nReviewer 3 raises some concerns about this proposed variation of TFIDF with particular attention to why the TF metric is defined at the document level. We count the number of tweets that a term appears in (rather than raw total frequency of the term) since we are focused on gauging the number of *people* using a term at a given time. A raw count of occurrence can be biased by a single user repeating a term many times in one tweet, which we did observe. R3 also noticed that our TFIDF variant is sensitive to very infrequent terms (which is also the case with traditional TFIDF). We also observed this problem and handled it by treating terms with frequencies below a certain threshold as stop words.\n