Digital Paleontology                                   Digging for Ancient Tweets                                       Ma...
Hi. My name is Martin Lafréchoux.I am a PhD student at Paris Ouest Nanterre. My dissertation deals with the web page as ad...
- Twitter is live. You watch twitter as you would TV, to see whats happening right now.When a piece of information spreads...
Basic use:                        visiting twitter.comThe most simple way is to observe information cascades is in the wil...
Real Time:                            the Streaming APIThe most complete API is called the streaming API. Complete access ...
Access is pretty limited, both expensive and exclusive.Gaining access to the firehose is one thing. Then you have to have t...
October 2012 average :            500 000 000 / dayA complete recording for a month of twitter in january 2011 is about 1 ...
There is not much you can do with such an astonishing stream of data besides displaying it,as twitter does, or storing it ...
Digital Taxidermy ?Once recorded, the live data is frozen. The cascades are turned in something like a stuffedanimal. Digi...
Just-in-time:                      The Search APIThe next best option is to use the Search API to analyze tweets in real-t...
A bit too late :                                              the REST APIBut what happens after that, after real-time? Wh...
Much too late...               Digital PaleontologyYou don’t have access to the real, live cascade anymore, but you can st...
May 01, 2011 – 03:58PM ETOn twitter, it began like this. A Pakistani can’t sleep because of a helicopter.(Timestamps are c...
May 01, 2011 – 10:24PM ETSix and a half hours later, wrestler and entertainer Dawyne ‘The Rock’ Johnson posts a rathercryp...
May 01, 2011 – 10:24PM ETAt the same exact minute, former Rumsfeld Chief of Staff posts a more explicit one.
People begin gathering in front of the White House, waiting for the press conference.--Pic: http://www.flickr.com/photos/th...
May 02, 2011 – 11:35PM ETAn hour later, Barak Obama appeared on TV to announce the news.Responses were mixed.--Pic: http:/...
Some Americans were very vocal in their enthusiasm.Pic: http://www.flickr.com/photos/zokuga/5678699597/
“I have never wished a man dead, but I have read       some obituaries with great pleasure.”                              ...
“I have never wished a man dead, but I have read       some obituaries with great pleasure.”                              ...
It seems that Mark Twain is to the US as Winston Churchill is to the UK or Jules Renard toFrance: funny quotes are attribu...
“I mourn the loss of thousands of precious lives, but I    will not rejoice in the death of one, not even an enemy.”      ...
“I mourn the loss of thousands of precious lives, but I    will not rejoice in the death of one, not even an enemy.”Some A...
“I mourn the loss of thousands of precious lives, but I    will not rejoice in the death of one, not even an enemy.”      ...
The cascadeWe don’t often get to see the starting point of an information epidemic, but in this case it isknown.--Pic: htt...
It all started from this facebook post.Jessica Dovey, who teaches English in Kobe, Japan posts the following message on he...
May 02, 2011 – 12:15PM ETIt’s interesting because :- we don’t always get access to the starting point of a cascade- we don...
All that we know is that at some point, someone stripped the quote of anything that wasactually written by MLK and present...
All that we know is that at some point, someone stripped the quote of anything that wasactually written by MLK and present...
02:22PM ETA user posted the whole quote to twitlonger, mistakenly attributing the whole of it to MLK.Quite a few tweets po...
Many of these tweets have been deleted, probably in shame when their authors realized theyhad posted a fake quote.
And then someone posted the first part of the misattributed quote on twitter.We only have a faint trace it left on Topsy, a...
02:42PM ETFortunately for us, some twitter archives exist.Here is the earliest recorded tweet.(Theses tweets are called th...
02:52PM ETTen minutes later, a properly formatted quote reaches a so-called ‘Influential account’
03:15PM ET25 minutes later, the quote is reposted by Penn Jillette, a pretty famous US magician.At that point, several thi...
May 2, 6:23PMMegan McArdle, then an editor at The Atlantic, publishes a blog post at the end of theafternoon where she exp...
It’s thanks to the work of Megan McArdle at the Atlantic that we have Dovey’s screencap.She obtained it via conventional j...
Penn Jillette retracts his previous tweet as soon as he realizes his mistake.This tweet does not really attract the same k...
May 3,              03:20PM ETThe interesting thing is that media coverage modified - almost bent - the cascade as it wasst...
Why did Penn Jillette create a fake Martin Luther King Jr.                                        quote yesterday? - Twitt...
PJ reacts.
Salon then posted a follow up piece, which was actually deleted at some later point - I don’tknow when, I just realized it...
And that was the cache last night.Penn Jilette did not delete his tweets, which is a rather classy move on his part.
Then the big players and blogs take it from there.kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp)And Jason Kott...
Then the big players and blogs take it from there.kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp)And Jason Kott...
May 2,                                                06:26PM ETFrom there, the cascade branches again, creating a new twi...
Most tweets calling out the MLK quote as fake contain a link to some type of blog post ornews article.My hypothesis is tha...
As you can see, the peak of the links corpus (yellow) comes about 24h after the ‘quotescorpus’ peak.
Tools & EvaluationA quick review of the tools used for this presentation.--Pic: http://www.flickr.com/photos/anomieus/62058...
[journalists]If I may go back to the wildlife metaphor for a second, journalists are a bit like a crew filminga wildlife do...
Some of them used a bit of google-fu to identify the quote as fake. With a date range filter,you can see wether a quote jus...
[data journalists][This is probably what data journalists look like.]--Pic: http://www.public-domain-image.com/
As for my own tools, the results shown today rely mostly on a twitter archiving service calledTopsy.When twitter announced...
There are various twitter archiving services. I chose Topsy because it is the most fullyfeatured free twitter archive that...
Topsy even offers an API, otter. You can interact with it using python-otter, a python library,or directly using the REST ...
Topsy archives...          > tweets containing a link, or          > tweets that were retweetedHow much does Topsy archive...
Interesting caveat: Topsy records tweets when they are published. It means that if you deletea tweet after Topsy has archi...
Interesting caveat: Topsy records tweets when they are published. It means that if you deletea tweet after Topsy has archi...
‘quote’ corpus                  2657 different authors in Topsy data                    394 were unavailable on twitter   ...
test sample                                    174368 tweets, including :                                    3140 RT      ...
In closing...--Pic: http://www.flickr.com/photos/biodivlibrary/6217534124/
•Resource decay is not uniform              •Linked content has more value              •Media coverage plays an important...
Thank you                          m.lafrechoux@gmail.com                              http://nologos.net/                ...
Prochain SlideShare
Chargement dans…5
×

Digital Paleontology - Digging for Ancient Tweets

3 792 vues

Publié le

0 commentaire
0 j’aime
Statistiques
Remarques
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Aucun téléchargement
Vues
Nombre de vues
3 792
Sur SlideShare
0
Issues des intégrations
0
Intégrations
983
Actions
Partages
0
Téléchargements
11
Commentaires
0
J’aime
0
Intégrations 0
Aucune incorporation

Aucune remarque pour cette diapositive

Digital Paleontology - Digging for Ancient Tweets

  1. 1. Digital Paleontology Digging for Ancient Tweets Martin Lafréchoux JITSO 2012 – EPFL, December 4th, 2012Full text of the Research notes is available on the JITSO website and on my research blog--Pic : www.flickr.com/photos/mag3737/307016400/
  2. 2. Hi. My name is Martin Lafréchoux.I am a PhD student at Paris Ouest Nanterre. My dissertation deals with the web page as adocument.Ill try not to repeat the content of the research notes published on the website. Id ratheraddress some points that were left out due to space constraints or because they were notready for publication.Mostly I’ll talk about the various twitter APIs and their implications to the researcher.But first Id like to try and explain the title of the presentation. Digital Paleontology - why ?
  3. 3. - Twitter is live. You watch twitter as you would TV, to see whats happening right now.When a piece of information spreads on twitter, it creates a cascadeCascades are live data, best observed in the wild ; thats the whole point of just-in-timesociology.The cascade goes on for a short while, then it disappears.Twitter offers several ways to observe these cascades.We will go through them brieflyIt’s a bit hard to represent something as vivid in a slide, so I hope you can pardon me forusing shaky visual metaphors
  4. 4. Basic use: visiting twitter.comThe most simple way is to observe information cascades is in the wild, using the twitter webclient.You’re in the middle of the action, but that’s not always the best place to see the big picture.If that’s not precise enough, twitter offers several API with distinct characteristics.--Pic : http://www.public-domain-image.com/
  5. 5. Real Time: the Streaming APIThe most complete API is called the streaming API. Complete access to all posted tweets iscalled the firehose.--Pic : http://www.flickr.com/photos/usnavy/5887790560/
  6. 6. Access is pretty limited, both expensive and exclusive.Gaining access to the firehose is one thing. Then you have to have the stomach to drink fromit.
  7. 7. October 2012 average : 500 000 000 / dayA complete recording for a month of twitter in january 2011 is about 1 billion tweets (Myers& Leskovec, 2012), that’s about 400 unique tweets per second on average, not counting RTs.That’s for an average day, almost two years ago.In October twitter CEO reported 500 million tweets per day. http://news.cnet.com/8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/That’s about 5800 per second.--Pic: http://www.flickr.com/photos/chrish_99/7431798496/
  8. 8. There is not much you can do with such an astonishing stream of data besides displaying it,as twitter does, or storing it to analyze it later. Even that requires huge resources.--Pic: http://www.flickr.com/photos/thomashawk/5683179189/
  9. 9. Digital Taxidermy ?Once recorded, the live data is frozen. The cascades are turned in something like a stuffedanimal. Digital taxidermy, if you will.--Damien Hirst, The Physical Imbossibility of Death in the Mind of Someone LivingPhoto: http://www.flickr.com/photos/chaostrophy/2594401926/
  10. 10. Just-in-time: The Search APIThe next best option is to use the Search API to analyze tweets in real-time, as they arepublished.After real-time recording, you get just-in-time. For 6 to nine days, you can use the twittersearch API.The search API would be like an underwater tunnel, giving you easy access to the closestdata.--Pic: http://www.flickr.com/photos/lorensztajer/4201751064/
  11. 11. A bit too late : the REST APIBut what happens after that, after real-time? When you are just-too-late?After a week or so, you only have the REST API available.It means you cannot search, and you only get 150 API calls an hour.You can view the details of a tweet or a user if you already know where to look.As the twitter dev docs put it, you need to get «creative» to get what you want with suchlimitations.--Pic: http://www.flickr.com/photos/stinkenroboter/6604532503/
  12. 12. Much too late... Digital PaleontologyYou don’t have access to the real, live cascade anymore, but you can still see its shadow, itsghost.And if you spend enough time putting back together the scattered pieces, it can give you agood idea of what the cascade was.So, what happens when you were too late for just-in-time amounts to me digitalpaleontology.There are bits and pieces of data, scattered all over the web if you only know where to dig.For my PhD, I am trying to work a bit of digital paleontology to reconstruct an informationcascade that happened more than a year and a half ago, on may 2, 2011.As you may have guessed, I originally tried to map the cascade about 15 days after the factusing the twitter API, and came back sorely disappointed. I tried again six months later, andthis presentation is the result of my work.--Pic: http://www.flickr.com/photos/14508691@N08/4531324072/
  13. 13. May 01, 2011 – 03:58PM ETOn twitter, it began like this. A Pakistani can’t sleep because of a helicopter.(Timestamps are crucial. I have converted everything to Eastern Time for clarity.)
  14. 14. May 01, 2011 – 10:24PM ETSix and a half hours later, wrestler and entertainer Dawyne ‘The Rock’ Johnson posts a rathercryptic tweet.
  15. 15. May 01, 2011 – 10:24PM ETAt the same exact minute, former Rumsfeld Chief of Staff posts a more explicit one.
  16. 16. People begin gathering in front of the White House, waiting for the press conference.--Pic: http://www.flickr.com/photos/theqspeaks/5679548043/
  17. 17. May 02, 2011 – 11:35PM ETAn hour later, Barak Obama appeared on TV to announce the news.Responses were mixed.--Pic: http://www.flickr.com/photos/us_embassy_newzealand/5682145416/
  18. 18. Some Americans were very vocal in their enthusiasm.Pic: http://www.flickr.com/photos/zokuga/5678699597/
  19. 19. “I have never wished a man dead, but I have read some obituaries with great pleasure.” Mark TwainSomewhat less enthusiastic Americans expressed their feelings by tweeting this quote byMark Twain.... which is actually by civil rights lawyer Clarence Darrow.--Pic : http://en.wikipedia.org/wiki/File:Mark_Twain,_Brady-Handy_photo_portrait,_Feb_7,_1871,_cropped.jpg
  20. 20. “I have never wished a man dead, but I have read some obituaries with great pleasure.” Clarence DarrowSomewhat less enthusiastic Americans expressed their feelings by tweeting this quote byMark Twain.... which is actually by civil rights lawyer Clarence Darrow.--Pic : http://en.wikipedia.org/wiki/File:Mark_Twain,_Brady-Handy_photo_portrait,_Feb_7,_1871,_cropped.jpg
  21. 21. It seems that Mark Twain is to the US as Winston Churchill is to the UK or Jules Renard toFrance: funny quotes are attributed to him by default.The mistake probably comes down to this page. You can imagine how this would look as aGoogle snippet.http://www.estatevaults.com/lm/archives/2006/08/16/twain_and_darro.html
  22. 22. “I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.” Martin Luther King, Jr.Some Americans were appalled that bin Laden was killed rather than taken into custody.Many chose to tweet this quote of Martin Luther King to express their feelings.As you may have guessed, Martin Luther King never said or wrote this sentence.And this is what I will be talking about today, after a rather lengthy introduction.Now, how did this happen? How did this misattributed quote go viral?Now that’s a job for a digital paleontologist.--Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg
  23. 23. “I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.”Some Americans were appalled that bin Laden was killed rather than taken into custody.Many chose to tweet this quote of Martin Luther King to express their feelings.As you may have guessed, Martin Luther King never said or wrote this sentence.And this is what I will be talking about today, after a rather lengthy introduction.Now, how did this happen? How did this misattributed quote go viral?Now that’s a job for a digital paleontologist.--Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg
  24. 24. “I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.” ?Some Americans were appalled that bin Laden was killed rather than taken into custody.Many chose to tweet this quote of Martin Luther King to express their feelings.As you may have guessed, Martin Luther King never said or wrote this sentence.And this is what I will be talking about today, after a rather lengthy introduction.Now, how did this happen? How did this misattributed quote go viral?Now that’s a job for a digital paleontologist.--Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg
  25. 25. The cascadeWe don’t often get to see the starting point of an information epidemic, but in this case it isknown.--Pic: http://www.flickr.com/photos/vilseskogen/3279138165/
  26. 26. It all started from this facebook post.Jessica Dovey, who teaches English in Kobe, Japan posts the following message on herFacebook wall.
  27. 27. May 02, 2011 – 12:15PM ETIt’s interesting because :- we don’t always get access to the starting point of a cascade- we don’t often get to see Facebook at all.I was prepared to say that we have no idea what happened exactly, but I’ll make a guess.
  28. 28. All that we know is that at some point, someone stripped the quote of anything that wasactually written by MLK and presented him as the author.Here is what (to the best of my knowledge) happened.
  29. 29. All that we know is that at some point, someone stripped the quote of anything that wasactually written by MLK and presented him as the author.Here is what (to the best of my knowledge) happened.
  30. 30. 02:22PM ETA user posted the whole quote to twitlonger, mistakenly attributing the whole of it to MLK.Quite a few tweets point to this page (it says 80, Topsy recorded a dozen or so)
  31. 31. Many of these tweets have been deleted, probably in shame when their authors realized theyhad posted a fake quote.
  32. 32. And then someone posted the first part of the misattributed quote on twitter.We only have a faint trace it left on Topsy, a twitter archiving service.
  33. 33. 02:42PM ETFortunately for us, some twitter archives exist.Here is the earliest recorded tweet.(Theses tweets are called the ‘quote corpus’ my research notes.)
  34. 34. 02:52PM ETTen minutes later, a properly formatted quote reaches a so-called ‘Influential account’
  35. 35. 03:15PM ET25 minutes later, the quote is reposted by Penn Jillette, a pretty famous US magician.At that point, several things happen.(a) The cascade accelerates exponentially(b) Some people, mostly journalists, begin to doubt the authenticity of the quote.
  36. 36. May 2, 6:23PMMegan McArdle, then an editor at The Atlantic, publishes a blog post at the end of theafternoon where she expresses her doubts as to the authenticity of the quote.--Newspaper icon: http://thenounproject.com/noun/newspaper/#icon-No1233
  37. 37. It’s thanks to the work of Megan McArdle at the Atlantic that we have Dovey’s screencap.She obtained it via conventional journalism techniques, such as e-mailing a human being.That’s significant because we would never have gained access to a private FB post via an API.Journalists have their own investigation techniques.They are a tad low-tech but this is changing, thanks to the ‘data-journalism’ trend currentlygoing on. (This comes with its own set of issues which I won’t discuss here).She followed up with a second article where she identifies Jessica Dovey.
  38. 38. Penn Jillette retracts his previous tweet as soon as he realizes his mistake.This tweet does not really attract the same kind of attention as the quote.
  39. 39. May 3, 03:20PM ETThe interesting thing is that media coverage modified - almost bent - the cascade as it wasstill unfolding. Journalists were there at the right time, when it was still happening.But they did not know how to use an API. This Salon.com article mistakenly believes that PJwas the first one to post the tweet on twitter.
  40. 40. Why did Penn Jillette create a fake Martin Luther King Jr. quote yesterday? - Twitter - Salon.com May 3, 03:20PM ETThe interesting thing is that media coverage modified - almost bent - the cascade as it wasstill unfolding. Journalists were there at the right time, when it was still happening.But they did not know how to use an API. This Salon.com article mistakenly believes that PJwas the first one to post the tweet on twitter.
  41. 41. PJ reacts.
  42. 42. Salon then posted a follow up piece, which was actually deleted at some later point - I don’tknow when, I just realized it while working on these very slides.A few days ago I could still access it using Google Cache, which is another form of ‘ghosting’.
  43. 43. And that was the cache last night.Penn Jilette did not delete his tweets, which is a rather classy move on his part.
  44. 44. Then the big players and blogs take it from there.kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp)And Jason Kottke, who actually wrote what I consider to be the best write-up.
  45. 45. Then the big players and blogs take it from there.kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp)And Jason Kottke, who actually wrote what I consider to be the best write-up.
  46. 46. May 2, 06:26PM ETFrom there, the cascade branches again, creating a new twitter track.Here is the first tweet identifying the quote as fake. It was posted a few minutes after MeganMcArdle’s first blog post but does not contain a link.
  47. 47. Most tweets calling out the MLK quote as fake contain a link to some type of blog post ornews article.My hypothesis is that people are more likely to include a link to some outside source whentheir tweet goes against the flow. It gives more weight to their tweets.Dissenting voices have a hard time being heard on twitter.[These tweets are called the ‘links corpus’ my research notes.]
  48. 48. As you can see, the peak of the links corpus (yellow) comes about 24h after the ‘quotescorpus’ peak.
  49. 49. Tools & EvaluationA quick review of the tools used for this presentation.--Pic: http://www.flickr.com/photos/anomieus/6205869109/
  50. 50. [journalists]If I may go back to the wildlife metaphor for a second, journalists are a bit like a crew filminga wildlife documentary.Journalists were instrumental in this whole affair. Their analog worked and got results thatwould have been impossible to obtain using API (most notably on Facebook).--Pic: http://www.public-domain-image.com/public-domain-images-pictures-free-stock-photos/people-public-domain-images-pictures/people-filming-in-wild-nature.jpg
  51. 51. Some of them used a bit of google-fu to identify the quote as fake. With a date range filter,you can see wether a quote just appeared on the web or not.http://www.pcworld.com/article/226912/google_daterange_filter.htmlBut their lack of knowledge of the inner workings of twitter showed. Salon blamed PJ, and noone was unable to identify the first twitter posts - less than 48 hours after the beginning ofthe cascade. The twitter Search API was still available.Salon, for example, does not reveal how they found their results, but obviously the methodswere suboptimal (my guess is that they did not realize that twitter displays so-called ‘Topresults’ as default search results).Thats changing as we speak. Data journalism is probably going to be featured in a number of2012 buzzwords list.
  52. 52. [data journalists][This is probably what data journalists look like.]--Pic: http://www.public-domain-image.com/
  53. 53. As for my own tools, the results shown today rely mostly on a twitter archiving service calledTopsy.When twitter announced its new terms of service this summer, almost all members of thetwitter ecosystem cried in terror. Not Topsy. Topsy is exactly the kind of service twitter wantsas a partner. This is the kind of ecosystem twitter is trying to build.
  54. 54. There are various twitter archiving services. I chose Topsy because it is the most fullyfeatured free twitter archive that I could find.The crucial point was that it allowed to use a date range filter on the search.
  55. 55. Topsy even offers an API, otter. You can interact with it using python-otter, a python library,or directly using the REST API.
  56. 56. Topsy archives... > tweets containing a link, or > tweets that were retweetedHow much does Topsy archive ? The first question would be ‘What does Topsy archive?’Tweets are recorded when either (a) they are retweeted by someone else or (b) they contain alink
  57. 57. Interesting caveat: Topsy records tweets when they are published. It means that if you deletea tweet after Topsy has archived it, the tweet will be removed from twitter, but not fromTopsy.It’s something to keep in mind: Topsy offers something like a ghost image, a remanentimage.
  58. 58. Interesting caveat: Topsy records tweets when they are published. It means that if you deletea tweet after Topsy has archived it, the tweet will be removed from twitter, but not fromTopsy.It’s something to keep in mind: Topsy offers something like a ghost image, a remanentimage.
  59. 59. ‘quote’ corpus 2657 different authors in Topsy data 394 were unavailable on twitter ( ) Hidden : 94 Suspended : 4 Closed : 296ghosting effect on the ‘quotes’ corpus
  60. 60. test sample 174368 tweets, including : 3140 RT 18400 links Topsy’s coverage ≈12,1%Topsy archives tweets that were retweeted or tweets containing an URL. Topsy archivestweets *that are links*. This is the value. The textual content is impossible / not cost-effective to exploit right now.To evaluate our results is to evaluate TopsyIt has become pretty hard to get a twitter dataset : twitter recently changed its terms ofservice and several previously available datasets have become unavailable.I got a sample from the streaming API. It contains about 175000 tweets, which amounts to afew minutes of firehose output.On our sample, Topsy’s coverage would be about 12%, excluding duplicates
  61. 61. In closing...--Pic: http://www.flickr.com/photos/biodivlibrary/6217534124/
  62. 62. •Resource decay is not uniform •Linked content has more value •Media coverage plays an important part in the phenomenon and in what survivesAt least in the case of twitter, decay is not evenly distributed.Content that survives is content that was interesting to someone, content that is linked. Thevalue is the link, not the content.In our case, it’s worth considering that media coverage both unearthed some content (the FBscreenshot) and buried other pieces of information (deleted tweets).Journalists and data investigation techniques are complementary. Data journalism willprobably fuse the two in the long run.
  63. 63. Thank you m.lafrechoux@gmail.com http://nologos.net/ @lagayascienzaThank you for your attention.Pic: http://www.flickr.com/photos/mmechtley/5379944214/

×