Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Digital Paleontology
                                   Digging for Ancient Tweets
                                       ...
Hi. My name is Martin Lafréchoux.

I am a PhD student at Paris Ouest Nanterre. My dissertation deals with the web page as ...
- Twitter is live. You watch twitter as you would TV, to see what's happening right now.

When a piece of information spre...
Basic use:
                        visiting twitter.com


The most simple way is to observe information cascades is in the...
Real Time:
                            the Streaming API
The most complete API is called the streaming API. Complete acces...
Access is pretty limited, both expensive and exclusive.

Gaining access to the firehose is one thing. Then you have to have...
October 2012 average :
            500 000 000 / day




A complete recording for a month of twitter in january 2011 is ab...
There is not much you can do with such an astonishing stream of data besides displaying it,
as twitter does, or storing it...
Digital Taxidermy ?




Once recorded, the live data is frozen. The cascades are turned in something like a stuffed
animal...
Just-in-time:
                      The Search API
The next best option is to use the Search API to analyze tweets in real...
A bit too late :
                                              the REST API
But what happens after that, after real-time? ...
Much too late...
               Digital Paleontology



You don’t have access to the real, live cascade anymore, but you c...
May 01, 2011 – 03:58PM ET




On twitter, it began like this. A Pakistani can’t sleep because of a helicopter.

(Timestamp...
May 01, 2011 – 10:24PM ET



Six and a half hours later, wrestler and entertainer Dawyne ‘The Rock’ Johnson posts a rather...
May 01, 2011 – 10:24PM ET




At the same exact minute, former Rumsfeld Chief of Staff posts a more explicit one.
People begin gathering in front of the White House, waiting for the press conference.

--
Pic: http://www.flickr.com/photos...
May 02, 2011 – 11:35PM ET


An hour later, Barak Obama appeared on TV to announce the news.

Responses were mixed.

--
Pic...
Some Americans were very vocal in their enthusiasm.

Pic: http://www.flickr.com/photos/zokuga/5678699597/
“I have never wished a man dead, but I have read
       some obituaries with great pleasure.”
                            ...
“I have never wished a man dead, but I have read
       some obituaries with great pleasure.”
                            ...
It seems that Mark Twain is to the US as Winston Churchill is to the UK or Jules Renard to
France: funny quotes are attrib...
“I mourn the loss of thousands of precious lives, but I
    will not rejoice in the death of one, not even an enemy.”
    ...
“I mourn the loss of thousands of precious lives, but I
    will not rejoice in the death of one, not even an enemy.”



S...
“I mourn the loss of thousands of precious lives, but I
    will not rejoice in the death of one, not even an enemy.”
    ...
The cascade



We don’t often get to see the starting point of an information epidemic, but in this case it is
known.

--
...
It all started from this facebook post.

Jessica Dovey, who teaches English in Kobe, Japan posts the following message on ...
May 02, 2011 – 12:15PM ET




It’s interesting because :
- we don’t always get access to the starting point of a cascade
-...
All that we know is that at some point, someone stripped the quote of anything that was
actually written by MLK and presen...
All that we know is that at some point, someone stripped the quote of anything that was
actually written by MLK and presen...
02:22PM ET

A user posted the whole quote to twitlonger, mistakenly attributing the whole of it to MLK.

Quite a few tweet...
Many of these tweets have been deleted, probably in shame when their authors realized they
had posted a fake quote.
And then someone posted the first part of the misattributed quote on twitter.

We only have a faint trace it left on Topsy,...
02:42PM ET




Fortunately for us, some twitter archives exist.

Here is the earliest recorded tweet.

(Theses tweets are ...
02:52PM ET




Ten minutes later, a properly formatted quote reaches a so-called ‘Influential account’
03:15PM ET




25 minutes later, the quote is reposted by Penn Jillette, a pretty famous US magician.

At that point, seve...
May 2, 6:23PM




Megan McArdle, then an editor at The Atlantic, publishes a blog post at the end of the
afternoon where s...
It’s thanks to the work of Megan McArdle at the Atlantic that we have Dovey’s screencap.

She obtained it via conventional...
Penn Jillette retracts his previous tweet as soon as he realizes his mistake.
This tweet does not really attract the same ...
May 3,
              03:20PM ET




The interesting thing is that media coverage modified - almost bent - the cascade as it...
Why did Penn Jillette create a fake Martin Luther King Jr.
                                        quote yesterday? - Twit...
PJ reacts.
Salon then posted a follow up piece, which was actually deleted at some later point - I don’t
know when, I just realized i...
And that was the cache last night.

Penn Jilette did not delete his tweets, which is a rather classy move on his part.
Then the big players and blogs take it from there.

kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp)

And Jason ...
Then the big players and blogs take it from there.

kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp)

And Jason ...
May 2,
                                                06:26PM ET

From there, the cascade branches again, creating a new ...
Most tweets calling out the MLK quote as fake contain a link to some type of blog post or
news article.

My hypothesis is ...
As you can see, the peak of the links corpus (yellow) comes about 24h after the ‘quotes
corpus’ peak.
Tools & Evaluation
A quick review of the tools used for this presentation.

--
Pic: http://www.flickr.com/photos/anomieus/6...
[journalists]




If I may go back to the wildlife metaphor for a second, journalists are a bit like a crew filming
a wildl...
Some of them used a bit of google-fu to identify the quote as fake. With a date range filter,
you can see wether a quote ju...
[data journalists]




[This is probably what data journalists look like.]

--
Pic: http://www.public-domain-image.com/
As for my own tools, the results shown today rely mostly on a twitter archiving service called
Topsy.

When twitter announ...
There are various twitter archiving services. I chose Topsy because it is the most fully
featured free twitter archive tha...
Topsy even offers an API, otter. You can interact with it using python-otter, a python library,
or directly using the REST...
Topsy archives...
          > tweets containing a link, or
          > tweets that were retweeted



How much does Topsy a...
Interesting caveat: Topsy records tweets when they are published. It means that if you delete
a tweet after Topsy has arch...
Interesting caveat: Topsy records tweets when they are published. It means that if you delete
a tweet after Topsy has arch...
‘quote’ corpus
                  2657 different authors in Topsy data
                    394 were unavailable on twitter
...
test sample
                                    174368 tweets, including :

                                    3140 RT

 ...
In closing...
--
Pic: http://www.flickr.com/photos/biodivlibrary/6217534124/
•Resource decay is not uniform
              •Linked content has more value
              •Media coverage plays an importa...
Thank you
                          m.lafrechoux@gmail.com
                              http://nologos.net/
             ...
Prochain SlideShare
Chargement dans…5
×

sur

Digital Paleontology - Digging for Ancient Tweets Slide 1 Digital Paleontology - Digging for Ancient Tweets Slide 2 Digital Paleontology - Digging for Ancient Tweets Slide 3 Digital Paleontology - Digging for Ancient Tweets Slide 4 Digital Paleontology - Digging for Ancient Tweets Slide 5 Digital Paleontology - Digging for Ancient Tweets Slide 6 Digital Paleontology - Digging for Ancient Tweets Slide 7 Digital Paleontology - Digging for Ancient Tweets Slide 8 Digital Paleontology - Digging for Ancient Tweets Slide 9 Digital Paleontology - Digging for Ancient Tweets Slide 10 Digital Paleontology - Digging for Ancient Tweets Slide 11 Digital Paleontology - Digging for Ancient Tweets Slide 12 Digital Paleontology - Digging for Ancient Tweets Slide 13 Digital Paleontology - Digging for Ancient Tweets Slide 14 Digital Paleontology - Digging for Ancient Tweets Slide 15 Digital Paleontology - Digging for Ancient Tweets Slide 16 Digital Paleontology - Digging for Ancient Tweets Slide 17 Digital Paleontology - Digging for Ancient Tweets Slide 18 Digital Paleontology - Digging for Ancient Tweets Slide 19 Digital Paleontology - Digging for Ancient Tweets Slide 20 Digital Paleontology - Digging for Ancient Tweets Slide 21 Digital Paleontology - Digging for Ancient Tweets Slide 22 Digital Paleontology - Digging for Ancient Tweets Slide 23 Digital Paleontology - Digging for Ancient Tweets Slide 24 Digital Paleontology - Digging for Ancient Tweets Slide 25 Digital Paleontology - Digging for Ancient Tweets Slide 26 Digital Paleontology - Digging for Ancient Tweets Slide 27 Digital Paleontology - Digging for Ancient Tweets Slide 28 Digital Paleontology - Digging for Ancient Tweets Slide 29 Digital Paleontology - Digging for Ancient Tweets Slide 30 Digital Paleontology - Digging for Ancient Tweets Slide 31 Digital Paleontology - Digging for Ancient Tweets Slide 32 Digital Paleontology - Digging for Ancient Tweets Slide 33 Digital Paleontology - Digging for Ancient Tweets Slide 34 Digital Paleontology - Digging for Ancient Tweets Slide 35 Digital Paleontology - Digging for Ancient Tweets Slide 36 Digital Paleontology - Digging for Ancient Tweets Slide 37 Digital Paleontology - Digging for Ancient Tweets Slide 38 Digital Paleontology - Digging for Ancient Tweets Slide 39 Digital Paleontology - Digging for Ancient Tweets Slide 40 Digital Paleontology - Digging for Ancient Tweets Slide 41 Digital Paleontology - Digging for Ancient Tweets Slide 42 Digital Paleontology - Digging for Ancient Tweets Slide 43 Digital Paleontology - Digging for Ancient Tweets Slide 44 Digital Paleontology - Digging for Ancient Tweets Slide 45 Digital Paleontology - Digging for Ancient Tweets Slide 46 Digital Paleontology - Digging for Ancient Tweets Slide 47 Digital Paleontology - Digging for Ancient Tweets Slide 48 Digital Paleontology - Digging for Ancient Tweets Slide 49 Digital Paleontology - Digging for Ancient Tweets Slide 50 Digital Paleontology - Digging for Ancient Tweets Slide 51 Digital Paleontology - Digging for Ancient Tweets Slide 52 Digital Paleontology - Digging for Ancient Tweets Slide 53 Digital Paleontology - Digging for Ancient Tweets Slide 54 Digital Paleontology - Digging for Ancient Tweets Slide 55 Digital Paleontology - Digging for Ancient Tweets Slide 56 Digital Paleontology - Digging for Ancient Tweets Slide 57 Digital Paleontology - Digging for Ancient Tweets Slide 58 Digital Paleontology - Digging for Ancient Tweets Slide 59 Digital Paleontology - Digging for Ancient Tweets Slide 60 Digital Paleontology - Digging for Ancient Tweets Slide 61 Digital Paleontology - Digging for Ancient Tweets Slide 62 Digital Paleontology - Digging for Ancient Tweets Slide 63
  • Soyez le premier à aimer ceci

Digital Paleontology - Digging for Ancient Tweets

  1. 1. Digital Paleontology Digging for Ancient Tweets Martin Lafréchoux JITSO 2012 – EPFL, December 4th, 2012 Full text of the Research notes is available on the JITSO website and on my research blog -- Pic : www.flickr.com/photos/mag3737/307016400/
  2. 2. Hi. My name is Martin Lafréchoux. I am a PhD student at Paris Ouest Nanterre. My dissertation deals with the web page as a document. I'll try not to repeat the content of the research notes published on the website. I'd rather address some points that were left out due to space constraints or because they were not ready for publication. Mostly I’ll talk about the various twitter APIs and their implications to the researcher. But first I'd like to try and explain the title of the presentation. Digital Paleontology - why ?
  3. 3. - Twitter is live. You watch twitter as you would TV, to see what's happening right now. When a piece of information spreads on twitter, it creates a cascade Cascades are live data, best observed in the wild ; that's the whole point of just-in-time sociology. The cascade goes on for a short while, then it disappears. Twitter offers several ways to observe these cascades. We will go through them briefly It’s a bit hard to represent something as vivid in a slide, so I hope you can pardon me for using shaky visual metaphors
  4. 4. Basic use: visiting twitter.com The most simple way is to observe information cascades is in the wild, using the twitter web client. You’re in the middle of the action, but that’s not always the best place to see the big picture. If that’s not precise enough, twitter offers several API with distinct characteristics. -- Pic : http://www.public-domain-image.com/
  5. 5. Real Time: the Streaming API The most complete API is called the streaming API. Complete access to all posted tweets is called the firehose. -- Pic : http://www.flickr.com/photos/usnavy/5887790560/
  6. 6. Access is pretty limited, both expensive and exclusive. Gaining access to the firehose is one thing. Then you have to have the stomach to drink from it.
  7. 7. October 2012 average : 500 000 000 / day A complete recording for a month of twitter in january 2011 is about 1 billion tweets (Myers & Leskovec, 2012), that’s about 400 unique tweets per second on average, not counting RTs. That’s for an average day, almost two years ago. In October twitter CEO reported 500 million tweets per day. http://news.cnet.com/ 8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/ That’s about 5800 per second. -- Pic: http://www.flickr.com/photos/chrish_99/7431798496/
  8. 8. There is not much you can do with such an astonishing stream of data besides displaying it, as twitter does, or storing it to analyze it later. Even that requires huge resources. -- Pic: http://www.flickr.com/photos/thomashawk/5683179189/
  9. 9. Digital Taxidermy ? Once recorded, the live data is frozen. The cascades are turned in something like a stuffed animal. Digital taxidermy, if you will. -- Damien Hirst, The Physical Imbossibility of Death in the Mind of Someone Living Photo: http://www.flickr.com/photos/chaostrophy/2594401926/
  10. 10. Just-in-time: The Search API The next best option is to use the Search API to analyze tweets in real-time, as they are published. After real-time recording, you get just-in-time. For 6 to nine days, you can use the twitter search API. The search API would be like an underwater tunnel, giving you easy access to the closest data. -- Pic: http://www.flickr.com/photos/lorensztajer/4201751064/
  11. 11. A bit too late : the REST API But what happens after that, after real-time? When you are just-too-late? After a week or so, you only have the REST API available. It means you cannot search, and you only get 150 API calls an hour. You can view the details of a tweet or a user if you already know where to look. As the twitter dev docs put it, you need to get «creative» to get what you want with such limitations. -- Pic: http://www.flickr.com/photos/stinkenroboter/6604532503/
  12. 12. Much too late... Digital Paleontology You don’t have access to the real, live cascade anymore, but you can still see its shadow, its ghost. And if you spend enough time putting back together the scattered pieces, it can give you a good idea of what the cascade was. So, what happens when you were too late for just-in-time amounts to me digital paleontology. There are bits and pieces of data, scattered all over the web if you only know where to dig. For my PhD, I am trying to work a bit of digital paleontology to reconstruct an information cascade that happened more than a year and a half ago, on may 2, 2011. As you may have guessed, I originally tried to map the cascade about 15 days after the fact using the twitter API, and came back sorely disappointed. I tried again six months later, and this presentation is the result of my work. -- Pic: http://www.flickr.com/photos/14508691@N08/4531324072/
  13. 13. May 01, 2011 – 03:58PM ET On twitter, it began like this. A Pakistani can’t sleep because of a helicopter. (Timestamps are crucial. I have converted everything to Eastern Time for clarity.)
  14. 14. May 01, 2011 – 10:24PM ET Six and a half hours later, wrestler and entertainer Dawyne ‘The Rock’ Johnson posts a rather cryptic tweet.
  15. 15. May 01, 2011 – 10:24PM ET At the same exact minute, former Rumsfeld Chief of Staff posts a more explicit one.
  16. 16. People begin gathering in front of the White House, waiting for the press conference. -- Pic: http://www.flickr.com/photos/theqspeaks/5679548043/
  17. 17. May 02, 2011 – 11:35PM ET An hour later, Barak Obama appeared on TV to announce the news. Responses were mixed. -- Pic: http://www.flickr.com/photos/us_embassy_newzealand/5682145416/
  18. 18. Some Americans were very vocal in their enthusiasm. Pic: http://www.flickr.com/photos/zokuga/5678699597/
  19. 19. “I have never wished a man dead, but I have read some obituaries with great pleasure.” Mark Twain Somewhat less enthusiastic Americans expressed their feelings by tweeting this quote by Mark Twain. ... which is actually by civil rights lawyer Clarence Darrow. -- Pic : http://en.wikipedia.org/wiki/File:Mark_Twain,_Brady- Handy_photo_portrait,_Feb_7,_1871,_cropped.jpg
  20. 20. “I have never wished a man dead, but I have read some obituaries with great pleasure.” Clarence Darrow Somewhat less enthusiastic Americans expressed their feelings by tweeting this quote by Mark Twain. ... which is actually by civil rights lawyer Clarence Darrow. -- Pic : http://en.wikipedia.org/wiki/File:Mark_Twain,_Brady- Handy_photo_portrait,_Feb_7,_1871,_cropped.jpg
  21. 21. It seems that Mark Twain is to the US as Winston Churchill is to the UK or Jules Renard to France: funny quotes are attributed to him by default. The mistake probably comes down to this page. You can imagine how this would look as a Google snippet. http://www.estatevaults.com/lm/archives/2006/08/16/twain_and_darro.html
  22. 22. “I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.” Martin Luther King, Jr. Some Americans were appalled that bin Laden was killed rather than taken into custody. Many chose to tweet this quote of Martin Luther King to express their feelings. As you may have guessed, Martin Luther King never said or wrote this sentence. And this is what I will be talking about today, after a rather lengthy introduction. Now, how did this happen? How did this misattributed quote go viral? Now that’s a job for a digital paleontologist. -- Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg
  23. 23. “I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.” Some Americans were appalled that bin Laden was killed rather than taken into custody. Many chose to tweet this quote of Martin Luther King to express their feelings. As you may have guessed, Martin Luther King never said or wrote this sentence. And this is what I will be talking about today, after a rather lengthy introduction. Now, how did this happen? How did this misattributed quote go viral? Now that’s a job for a digital paleontologist. -- Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg
  24. 24. “I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.” ? Some Americans were appalled that bin Laden was killed rather than taken into custody. Many chose to tweet this quote of Martin Luther King to express their feelings. As you may have guessed, Martin Luther King never said or wrote this sentence. And this is what I will be talking about today, after a rather lengthy introduction. Now, how did this happen? How did this misattributed quote go viral? Now that’s a job for a digital paleontologist. -- Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg
  25. 25. The cascade We don’t often get to see the starting point of an information epidemic, but in this case it is known. -- Pic: http://www.flickr.com/photos/vilseskogen/3279138165/
  26. 26. It all started from this facebook post. Jessica Dovey, who teaches English in Kobe, Japan posts the following message on her Facebook wall.
  27. 27. May 02, 2011 – 12:15PM ET It’s interesting because : - we don’t always get access to the starting point of a cascade - we don’t often get to see Facebook at all. I was prepared to say that we have no idea what happened exactly, but I’ll make a guess.
  28. 28. All that we know is that at some point, someone stripped the quote of anything that was actually written by MLK and presented him as the author. Here is what (to the best of my knowledge) happened.
  29. 29. All that we know is that at some point, someone stripped the quote of anything that was actually written by MLK and presented him as the author. Here is what (to the best of my knowledge) happened.
  30. 30. 02:22PM ET A user posted the whole quote to twitlonger, mistakenly attributing the whole of it to MLK. Quite a few tweets point to this page (it says 80, Topsy recorded a dozen or so)
  31. 31. Many of these tweets have been deleted, probably in shame when their authors realized they had posted a fake quote.
  32. 32. And then someone posted the first part of the misattributed quote on twitter. We only have a faint trace it left on Topsy, a twitter archiving service.
  33. 33. 02:42PM ET Fortunately for us, some twitter archives exist. Here is the earliest recorded tweet. (Theses tweets are called the ‘quote corpus’ my research notes.)
  34. 34. 02:52PM ET Ten minutes later, a properly formatted quote reaches a so-called ‘Influential account’
  35. 35. 03:15PM ET 25 minutes later, the quote is reposted by Penn Jillette, a pretty famous US magician. At that point, several things happen. (a) The cascade accelerates exponentially (b) Some people, mostly journalists, begin to doubt the authenticity of the quote.
  36. 36. May 2, 6:23PM Megan McArdle, then an editor at The Atlantic, publishes a blog post at the end of the afternoon where she expresses her doubts as to the authenticity of the quote. -- Newspaper icon: http://thenounproject.com/noun/newspaper/#icon-No1233
  37. 37. It’s thanks to the work of Megan McArdle at the Atlantic that we have Dovey’s screencap. She obtained it via conventional journalism techniques, such as e-mailing a human being. That’s significant because we would never have gained access to a private FB post via an API. Journalists have their own investigation techniques. They are a tad low-tech but this is changing, thanks to the ‘data-journalism’ trend currently going on. (This comes with its own set of issues which I won’t discuss here). She followed up with a second article where she identifies Jessica Dovey.
  38. 38. Penn Jillette retracts his previous tweet as soon as he realizes his mistake. This tweet does not really attract the same kind of attention as the quote.
  39. 39. May 3, 03:20PM ET The interesting thing is that media coverage modified - almost bent - the cascade as it was still unfolding. Journalists were there at the right time, when it was still happening. But they did not know how to use an API. This Salon.com article mistakenly believes that PJ was the first one to post the tweet on twitter.
  40. 40. Why did Penn Jillette create a fake Martin Luther King Jr. quote yesterday? - Twitter - Salon.com May 3, 03:20PM ET The interesting thing is that media coverage modified - almost bent - the cascade as it was still unfolding. Journalists were there at the right time, when it was still happening. But they did not know how to use an API. This Salon.com article mistakenly believes that PJ was the first one to post the tweet on twitter.
  41. 41. PJ reacts.
  42. 42. Salon then posted a follow up piece, which was actually deleted at some later point - I don’t know when, I just realized it while working on these very slides. A few days ago I could still access it using Google Cache, which is another form of ‘ghosting’.
  43. 43. And that was the cache last night. Penn Jilette did not delete his tweets, which is a rather classy move on his part.
  44. 44. Then the big players and blogs take it from there. kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp) And Jason Kottke, who actually wrote what I consider to be the best write-up.
  45. 45. Then the big players and blogs take it from there. kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp) And Jason Kottke, who actually wrote what I consider to be the best write-up.
  46. 46. May 2, 06:26PM ET From there, the cascade branches again, creating a new twitter track. Here is the first tweet identifying the quote as fake. It was posted a few minutes after Megan McArdle’s first blog post but does not contain a link.
  47. 47. Most tweets calling out the MLK quote as fake contain a link to some type of blog post or news article. My hypothesis is that people are more likely to include a link to some outside source when their tweet goes 'against the flow'. It gives more weight to their tweets. Dissenting voices have a hard time being heard on twitter. [These tweets are called the ‘links corpus’ my research notes.]
  48. 48. As you can see, the peak of the links corpus (yellow) comes about 24h after the ‘quotes corpus’ peak.
  49. 49. Tools & Evaluation A quick review of the tools used for this presentation. -- Pic: http://www.flickr.com/photos/anomieus/6205869109/
  50. 50. [journalists] If I may go back to the wildlife metaphor for a second, journalists are a bit like a crew filming a wildlife documentary. Journalists were instrumental in this whole affair. Their analog worked and got results that would have been impossible to obtain using API (most notably on Facebook). -- Pic: http://www.public-domain-image.com/public-domain-images-pictures-free-stock- photos/people-public-domain-images-pictures/people-filming-in-wild-nature.jpg
  51. 51. Some of them used a bit of google-fu to identify the quote as fake. With a date range filter, you can see wether a quote just appeared on the web or not. http://www.pcworld.com/article/226912/google_daterange_filter.html But their lack of knowledge of the inner workings of twitter showed. Salon blamed PJ, and no one was unable to identify the first twitter posts - less than 48 hours after the beginning of the cascade. The twitter Search API was still available. Salon, for example, does not reveal how they found their results, but obviously the methods were suboptimal (my guess is that they did not realize that twitter displays so-called ‘Top results’ as default search results). That's changing as we speak. Data journalism is probably going to be featured in a number of 2012 buzzwords list.
  52. 52. [data journalists] [This is probably what data journalists look like.] -- Pic: http://www.public-domain-image.com/
  53. 53. As for my own tools, the results shown today rely mostly on a twitter archiving service called Topsy. When twitter announced its new terms of service this summer, almost all members of the twitter ecosystem cried in terror. Not Topsy. Topsy is exactly the kind of service twitter wants as a partner. This is the kind of ecosystem twitter is trying to build.
  54. 54. There are various twitter archiving services. I chose Topsy because it is the most fully featured free twitter archive that I could find. The crucial point was that it allowed to use a date range filter on the search.
  55. 55. Topsy even offers an API, otter. You can interact with it using python-otter, a python library, or directly using the REST API.
  56. 56. Topsy archives... > tweets containing a link, or > tweets that were retweeted How much does Topsy archive ? The first question would be ‘What does Topsy archive?’ Tweets are recorded when either (a) they are retweeted by someone else or (b) they contain a link
  57. 57. Interesting caveat: Topsy records tweets when they are published. It means that if you delete a tweet after Topsy has archived it, the tweet will be removed from twitter, but not from Topsy. It’s something to keep in mind: Topsy offers something like a ghost image, a remanent image.
  58. 58. Interesting caveat: Topsy records tweets when they are published. It means that if you delete a tweet after Topsy has archived it, the tweet will be removed from twitter, but not from Topsy. It’s something to keep in mind: Topsy offers something like a ghost image, a remanent image.
  59. 59. ‘quote’ corpus 2657 different authors in Topsy data 394 were unavailable on twitter ( ) Hidden : 94 Suspended : 4 Closed : 296 ghosting effect on the ‘quotes’ corpus
  60. 60. test sample 174368 tweets, including : 3140 RT 18400 links Topsy’s coverage ≈12,1% Topsy archives tweets that were retweeted or tweets containing an URL. Topsy archives tweets *that are links*. This is the value. The textual content is impossible / not cost- effective to exploit right now. To evaluate our results is to evaluate Topsy It has become pretty hard to get a twitter dataset : twitter recently changed its terms of service and several previously available datasets have become unavailable. I got a sample from the streaming API. It contains about 175000 tweets, which amounts to a few minutes of firehose output. On our sample, Topsy’s coverage would be about 12%, excluding duplicates
  61. 61. In closing... -- Pic: http://www.flickr.com/photos/biodivlibrary/6217534124/
  62. 62. •Resource decay is not uniform •Linked content has more value •Media coverage plays an important part in the phenomenon and in what survives At least in the case of twitter, decay is not evenly distributed. Content that survives is content that was interesting to someone, content that is linked. The value is the link, not the content. In our case, it’s worth considering that media coverage both unearthed some content (the FB screenshot) and buried other pieces of information (deleted tweets). Journalists and data investigation techniques are complementary. Data journalism will probably fuse the two in the long run.
  63. 63. Thank you m.lafrechoux@gmail.com http://nologos.net/ @lagayascienza Thank you for your attention. Pic: http://www.flickr.com/photos/mmechtley/5379944214/

Vues

Nombre de vues

5 013

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

1 511

Actions

Téléchargements

11

Partages

0

Commentaires

0

Mentions J'aime

0

×