SlideShare une entreprise Scribd logo
Digital Paleontology
                                   Digging for Ancient Tweets
                                       Martin Lafréchoux
                             JITSO 2012 – EPFL, December 4th, 2012




Full text of the Research notes is available on the JITSO website and on my research blog

--
Pic :   www.flickr.com/photos/mag3737/307016400/
Hi. My name is Martin Lafréchoux.

I am a PhD student at Paris Ouest Nanterre. My dissertation deals with the web page as a
document.

I'll try not to repeat the content of the research notes published on the website. I'd rather
address some points that were left out due to space constraints or because they were not
ready for publication.

Mostly I’ll talk about the various twitter APIs and their implications to the researcher.

But first I'd like to try and explain the title of the presentation. Digital Paleontology - why ?
- Twitter is live. You watch twitter as you would TV, to see what's happening right now.

When a piece of information spreads on twitter, it creates a cascade

Cascades are live data, best observed in the wild ; that's the whole point of just-in-time
sociology.

The cascade goes on for a short while, then it disappears.

Twitter offers several ways to observe these cascades.

We will go through them briefly

It’s a bit hard to represent something as vivid in a slide, so I hope you can pardon me for
using shaky visual metaphors
Basic use:
                        visiting twitter.com


The most simple way is to observe information cascades is in the wild, using the twitter web
client.

You’re in the middle of the action, but that’s not always the best place to see the big picture.

If that’s not precise enough, twitter offers several API with distinct characteristics.
--
Pic : http://www.public-domain-image.com/
Real Time:
                            the Streaming API
The most complete API is called the streaming API. Complete access to all posted tweets is
called the firehose.
--
Pic : http://www.flickr.com/photos/usnavy/5887790560/
Access is pretty limited, both expensive and exclusive.

Gaining access to the firehose is one thing. Then you have to have the stomach to drink from
it.
October 2012 average :
            500 000 000 / day




A complete recording for a month of twitter in january 2011 is about 1 billion tweets (Myers
& Leskovec, 2012), that’s about 400 unique tweets per second on average, not counting RTs.

That’s for an average day, almost two years ago.

In October twitter CEO reported 500 million tweets per day. http://news.cnet.com/
8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/
That’s about 5800 per second.

--
Pic: http://www.flickr.com/photos/chrish_99/7431798496/
There is not much you can do with such an astonishing stream of data besides displaying it,
as twitter does, or storing it to analyze it later. Even that requires huge resources.

--
Pic: http://www.flickr.com/photos/thomashawk/5683179189/
Digital Taxidermy ?




Once recorded, the live data is frozen. The cascades are turned in something like a stuffed
animal. Digital taxidermy, if you will.

--
Damien Hirst, The Physical Imbossibility of Death in the Mind of Someone Living
Photo: http://www.flickr.com/photos/chaostrophy/2594401926/
Just-in-time:
                      The Search API
The next best option is to use the Search API to analyze tweets in real-time, as they are
published.
After real-time recording, you get just-in-time. For 6 to nine days, you can use the twitter
search API.
The search API would be like an underwater tunnel, giving you easy access to the closest
data.

--
Pic: http://www.flickr.com/photos/lorensztajer/4201751064/
A bit too late :
                                              the REST API
But what happens after that, after real-time? When you are just-too-late?
After a week or so, you only have the REST API available.
It means you cannot search, and you only get 150 API calls an hour.
You can view the details of a tweet or a user if you already know where to look.

As the twitter dev docs put it, you need to get «creative» to get what you want with such
limitations.

--
Pic: http://www.flickr.com/photos/stinkenroboter/6604532503/
Much too late...
               Digital Paleontology



You don’t have access to the real, live cascade anymore, but you can still see its shadow, its
ghost.

And if you spend enough time putting back together the scattered pieces, it can give you a
good idea of what the cascade was.

So, what happens when you were too late for just-in-time amounts to me digital
paleontology.

There are bits and pieces of data, scattered all over the web if you only know where to dig.

For my PhD, I am trying to work a bit of digital paleontology to reconstruct an information
cascade that happened more than a year and a half ago, on may 2, 2011.

As you may have guessed, I originally tried to map the cascade about 15 days after the fact
using the twitter API, and came back sorely disappointed. I tried again six months later, and
this presentation is the result of my work.

--
Pic: http://www.flickr.com/photos/14508691@N08/4531324072/
May 01, 2011 – 03:58PM ET




On twitter, it began like this. A Pakistani can’t sleep because of a helicopter.

(Timestamps are crucial. I have converted everything to Eastern Time for clarity.)
May 01, 2011 – 10:24PM ET



Six and a half hours later, wrestler and entertainer Dawyne ‘The Rock’ Johnson posts a rather
cryptic tweet.
May 01, 2011 – 10:24PM ET




At the same exact minute, former Rumsfeld Chief of Staff posts a more explicit one.
People begin gathering in front of the White House, waiting for the press conference.

--
Pic: http://www.flickr.com/photos/theqspeaks/5679548043/
May 02, 2011 – 11:35PM ET


An hour later, Barak Obama appeared on TV to announce the news.

Responses were mixed.

--
Pic: http://www.flickr.com/photos/us_embassy_newzealand/5682145416/
Some Americans were very vocal in their enthusiasm.

Pic: http://www.flickr.com/photos/zokuga/5678699597/
“I have never wished a man dead, but I have read
       some obituaries with great pleasure.”
                                             Mark Twain

Somewhat less enthusiastic Americans expressed their feelings by tweeting this quote by
Mark Twain.

... which is actually by civil rights lawyer Clarence Darrow.

--
Pic : http://en.wikipedia.org/wiki/File:Mark_Twain,_Brady-
Handy_photo_portrait,_Feb_7,_1871,_cropped.jpg
“I have never wished a man dead, but I have read
       some obituaries with great pleasure.”
                                        Clarence Darrow

Somewhat less enthusiastic Americans expressed their feelings by tweeting this quote by
Mark Twain.

... which is actually by civil rights lawyer Clarence Darrow.

--
Pic : http://en.wikipedia.org/wiki/File:Mark_Twain,_Brady-
Handy_photo_portrait,_Feb_7,_1871,_cropped.jpg
It seems that Mark Twain is to the US as Winston Churchill is to the UK or Jules Renard to
France: funny quotes are attributed to him by default.

The mistake probably comes down to this page. You can imagine how this would look as a
Google snippet.

http://www.estatevaults.com/lm/archives/2006/08/16/twain_and_darro.html
“I mourn the loss of thousands of precious lives, but I
    will not rejoice in the death of one, not even an enemy.”
                                                 Martin Luther King, Jr.

Some Americans were appalled that bin Laden was killed rather than taken into custody.
Many chose to tweet this quote of Martin Luther King to express their feelings.

As you may have guessed, Martin Luther King never said or wrote this sentence.

And this is what I will be talking about today, after a rather lengthy introduction.

Now, how did this happen? How did this misattributed quote go viral?

Now that’s a job for a digital paleontologist.

--
Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg
“I mourn the loss of thousands of precious lives, but I
    will not rejoice in the death of one, not even an enemy.”



Some Americans were appalled that bin Laden was killed rather than taken into custody.
Many chose to tweet this quote of Martin Luther King to express their feelings.

As you may have guessed, Martin Luther King never said or wrote this sentence.

And this is what I will be talking about today, after a rather lengthy introduction.

Now, how did this happen? How did this misattributed quote go viral?

Now that’s a job for a digital paleontologist.

--
Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg
“I mourn the loss of thousands of precious lives, but I
    will not rejoice in the death of one, not even an enemy.”
                                                                                       ?

Some Americans were appalled that bin Laden was killed rather than taken into custody.
Many chose to tweet this quote of Martin Luther King to express their feelings.

As you may have guessed, Martin Luther King never said or wrote this sentence.

And this is what I will be talking about today, after a rather lengthy introduction.

Now, how did this happen? How did this misattributed quote go viral?

Now that’s a job for a digital paleontologist.

--
Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg
The cascade



We don’t often get to see the starting point of an information epidemic, but in this case it is
known.

--
Pic: http://www.flickr.com/photos/vilseskogen/3279138165/
It all started from this facebook post.

Jessica Dovey, who teaches English in Kobe, Japan posts the following message on her
Facebook wall.
May 02, 2011 – 12:15PM ET




It’s interesting because :
- we don’t always get access to the starting point of a cascade
- we don’t often get to see Facebook at all.

I was prepared to say that we have no idea what happened exactly, but I’ll make a guess.
All that we know is that at some point, someone stripped the quote of anything that was
actually written by MLK and presented him as the author.

Here is what (to the best of my knowledge) happened.
All that we know is that at some point, someone stripped the quote of anything that was
actually written by MLK and presented him as the author.

Here is what (to the best of my knowledge) happened.
02:22PM ET

A user posted the whole quote to twitlonger, mistakenly attributing the whole of it to MLK.

Quite a few tweets point to this page (it says 80, Topsy recorded a dozen or so)
Many of these tweets have been deleted, probably in shame when their authors realized they
had posted a fake quote.
And then someone posted the first part of the misattributed quote on twitter.

We only have a faint trace it left on Topsy, a twitter archiving service.
02:42PM ET




Fortunately for us, some twitter archives exist.

Here is the earliest recorded tweet.

(Theses tweets are called the ‘quote corpus’ my research notes.)
02:52PM ET




Ten minutes later, a properly formatted quote reaches a so-called ‘Influential account’
03:15PM ET




25 minutes later, the quote is reposted by Penn Jillette, a pretty famous US magician.

At that point, several things happen.

(a) The cascade accelerates exponentially
(b) Some people, mostly journalists, begin to doubt the authenticity of the quote.
May 2, 6:23PM




Megan McArdle, then an editor at The Atlantic, publishes a blog post at the end of the
afternoon where she expresses her doubts as to the authenticity of the quote.

--
Newspaper icon: http://thenounproject.com/noun/newspaper/#icon-No1233
It’s thanks to the work of Megan McArdle at the Atlantic that we have Dovey’s screencap.

She obtained it via conventional journalism techniques, such as e-mailing a human being.

That’s significant because we would never have gained access to a private FB post via an API.

Journalists have their own investigation techniques.

They are a tad low-tech but this is changing, thanks to the ‘data-journalism’ trend currently
going on. (This comes with its own set of issues which I won’t discuss here).

She followed up with a second article where she identifies Jessica Dovey.
Penn Jillette retracts his previous tweet as soon as he realizes his mistake.
This tweet does not really attract the same kind of attention as the quote.
May 3,
              03:20PM ET




The interesting thing is that media coverage modified - almost bent - the cascade as it was
still unfolding. Journalists were there at the right time, when it was still happening.

But they did not know how to use an API. This Salon.com article mistakenly believes that PJ
was the first one to post the tweet on twitter.
Why did Penn Jillette create a fake Martin Luther King Jr.
                                        quote yesterday? - Twitter - Salon.com




                May 3,
              03:20PM ET




The interesting thing is that media coverage modified - almost bent - the cascade as it was
still unfolding. Journalists were there at the right time, when it was still happening.

But they did not know how to use an API. This Salon.com article mistakenly believes that PJ
was the first one to post the tweet on twitter.
PJ reacts.
Salon then posted a follow up piece, which was actually deleted at some later point - I don’t
know when, I just realized it while working on these very slides.

A few days ago I could still access it using Google Cache, which is another form of ‘ghosting’.
And that was the cache last night.

Penn Jilette did not delete his tweets, which is a rather classy move on his part.
Then the big players and blogs take it from there.

kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp)

And Jason Kottke, who actually wrote what I consider to be the best write-up.
Then the big players and blogs take it from there.

kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp)

And Jason Kottke, who actually wrote what I consider to be the best write-up.
May 2,
                                                06:26PM ET

From there, the cascade branches again, creating a new twitter track.

Here is the first tweet identifying the quote as fake. It was posted a few minutes after Megan
McArdle’s first blog post but does not contain a link.
Most tweets calling out the MLK quote as fake contain a link to some type of blog post or
news article.

My hypothesis is that people are more likely to include a link to some outside source when
their tweet goes 'against the flow'. It gives more weight to their tweets.

Dissenting voices have a hard time being heard on twitter.

[These tweets are called the ‘links corpus’ my research notes.]
As you can see, the peak of the links corpus (yellow) comes about 24h after the ‘quotes
corpus’ peak.
Tools & Evaluation
A quick review of the tools used for this presentation.

--
Pic: http://www.flickr.com/photos/anomieus/6205869109/
[journalists]




If I may go back to the wildlife metaphor for a second, journalists are a bit like a crew filming
a wildlife documentary.

Journalists were instrumental in this whole affair. Their analog worked and got results that
would have been impossible to obtain using API (most notably on Facebook).

--
Pic: http://www.public-domain-image.com/public-domain-images-pictures-free-stock-
photos/people-public-domain-images-pictures/people-filming-in-wild-nature.jpg
Some of them used a bit of google-fu to identify the quote as fake. With a date range filter,
you can see wether a quote just appeared on the web or not.
http://www.pcworld.com/article/226912/google_daterange_filter.html

But their lack of knowledge of the inner workings of twitter showed. Salon blamed PJ, and no
one was unable to identify the first twitter posts - less than 48 hours after the beginning of
the cascade. The twitter Search API was still available.

Salon, for example, does not reveal how they found their results, but obviously the methods
were suboptimal (my guess is that they did not realize that twitter displays so-called ‘Top
results’ as default search results).

That's changing as we speak. Data journalism is probably going to be featured in a number of
2012 buzzwords list.
[data journalists]




[This is probably what data journalists look like.]

--
Pic: http://www.public-domain-image.com/
As for my own tools, the results shown today rely mostly on a twitter archiving service called
Topsy.

When twitter announced its new terms of service this summer, almost all members of the
twitter ecosystem cried in terror. Not Topsy. Topsy is exactly the kind of service twitter wants
as a partner. This is the kind of ecosystem twitter is trying to build.
There are various twitter archiving services. I chose Topsy because it is the most fully
featured free twitter archive that I could find.

The crucial point was that it allowed to use a date range filter on the search.
Topsy even offers an API, otter. You can interact with it using python-otter, a python library,
or directly using the REST API.
Topsy archives...
          > tweets containing a link, or
          > tweets that were retweeted



How much does Topsy archive ? The first question would be ‘What does Topsy archive?’

Tweets are recorded when either (a) they are retweeted by someone else or (b) they contain a
link
Interesting caveat: Topsy records tweets when they are published. It means that if you delete
a tweet after Topsy has archived it, the tweet will be removed from twitter, but not from
Topsy.

It’s something to keep in mind: Topsy offers something like a ghost image, a remanent
image.
Interesting caveat: Topsy records tweets when they are published. It means that if you delete
a tweet after Topsy has archived it, the tweet will be removed from twitter, but not from
Topsy.

It’s something to keep in mind: Topsy offers something like a ghost image, a remanent
image.
‘quote’ corpus
                  2657 different authors in Topsy data
                    394 were unavailable on twitter




                             ( )    Hidden : 94
                                   Suspended : 4
                                    Closed : 296




ghosting effect on the ‘quotes’ corpus
test sample
                                    174368 tweets, including :

                                    3140 RT

                                    18400 links


                          Topsy’s coverage ≈12,1%

Topsy archives tweets that were retweeted or tweets containing an URL. Topsy archives
tweets *that are links*. This is the value. The textual content is impossible / not cost-
effective to exploit right now.

To evaluate our results is to evaluate Topsy

It has become pretty hard to get a twitter dataset : twitter recently changed its terms of
service and several previously available datasets have become unavailable.

I got a sample from the streaming API. It contains about 175000 tweets, which amounts to a
few minutes of firehose output.

On our sample, Topsy’s coverage would be about 12%, excluding duplicates
In closing...
--
Pic: http://www.flickr.com/photos/biodivlibrary/6217534124/
•Resource decay is not uniform
              •Linked content has more value
              •Media coverage plays an important part in
              the phenomenon and in what survives




At least in the case of twitter, decay is not evenly distributed.

Content that survives is content that was interesting to someone, content that is linked. The
value is the link, not the content.

In our case, it’s worth considering that media coverage both unearthed some content (the FB
screenshot) and buried other pieces of information (deleted tweets).

Journalists and data investigation techniques are complementary. Data journalism will
probably fuse the two in the long run.
Thank you
                          m.lafrechoux@gmail.com
                              http://nologos.net/
                               @lagayascienza




Thank you for your attention.

Pic: http://www.flickr.com/photos/mmechtley/5379944214/

Contenu connexe

Tendances

Mobile learning v3 students deck
Mobile learning v3 students deckMobile learning v3 students deck
Mobile learning v3 students deck
School District of Mystery Lake
 
Mobile Learning v3.5
Mobile Learning v3.5Mobile Learning v3.5
Mobile Learning v3.5
Darren Kuropatwa
 
Conflicting Content Your biggest nightmare
Conflicting Content Your biggest nightmareConflicting Content Your biggest nightmare
Conflicting Content Your biggest nightmare
Pi Datametrics
 
Twitter Statistics & Case Studies
Twitter Statistics & Case StudiesTwitter Statistics & Case Studies
Twitter Statistics & Case Studies
Rocky Fu
 
Papi & Mimeng_Ge
Papi & Mimeng_GePapi & Mimeng_Ge
Papi & Mimeng_Ge
Ge Wang
 
Public Sector Social Media Innovation
Public Sector Social Media Innovation Public Sector Social Media Innovation
Public Sector Social Media Innovation
Dustin Haisler
 
Time for Being Social Basics
Time for Being Social BasicsTime for Being Social Basics
Time for Being Social Basics
Jamie Lynn Morgan/Rubber Tire Adventures
 
Building a Hacker Culture in Uruguay - OSCON 2011
Building a Hacker Culture in Uruguay - OSCON 2011Building a Hacker Culture in Uruguay - OSCON 2011
Building a Hacker Culture in Uruguay - OSCON 2011
Rabble .
 
Iteigo20110225
Iteigo20110225Iteigo20110225
Iteigo20110225
managami
 
How The Open Data Community Died - A Warning From The Future
How The Open Data Community Died - A Warning From The FutureHow The Open Data Community Died - A Warning From The Future
How The Open Data Community Died - A Warning From The Future
Chris Taggart
 
Brady Forrest, Ignite
Brady Forrest, IgniteBrady Forrest, Ignite
Brady Forrest, Ignite
Signal Chicago 2012
 
Twitter
TwitterTwitter
Presentatie Social Media & crisiscommunicatie
Presentatie Social Media & crisiscommunicatiePresentatie Social Media & crisiscommunicatie
Presentatie Social Media & crisiscommunicatie
Buzzcapture
 
The Complete Guide to Twitter
The Complete Guide to TwitterThe Complete Guide to Twitter
The Complete Guide to Twitter
Lazar Vlad
 

Tendances (14)

Mobile learning v3 students deck
Mobile learning v3 students deckMobile learning v3 students deck
Mobile learning v3 students deck
 
Mobile Learning v3.5
Mobile Learning v3.5Mobile Learning v3.5
Mobile Learning v3.5
 
Conflicting Content Your biggest nightmare
Conflicting Content Your biggest nightmareConflicting Content Your biggest nightmare
Conflicting Content Your biggest nightmare
 
Twitter Statistics & Case Studies
Twitter Statistics & Case StudiesTwitter Statistics & Case Studies
Twitter Statistics & Case Studies
 
Papi & Mimeng_Ge
Papi & Mimeng_GePapi & Mimeng_Ge
Papi & Mimeng_Ge
 
Public Sector Social Media Innovation
Public Sector Social Media Innovation Public Sector Social Media Innovation
Public Sector Social Media Innovation
 
Time for Being Social Basics
Time for Being Social BasicsTime for Being Social Basics
Time for Being Social Basics
 
Building a Hacker Culture in Uruguay - OSCON 2011
Building a Hacker Culture in Uruguay - OSCON 2011Building a Hacker Culture in Uruguay - OSCON 2011
Building a Hacker Culture in Uruguay - OSCON 2011
 
Iteigo20110225
Iteigo20110225Iteigo20110225
Iteigo20110225
 
How The Open Data Community Died - A Warning From The Future
How The Open Data Community Died - A Warning From The FutureHow The Open Data Community Died - A Warning From The Future
How The Open Data Community Died - A Warning From The Future
 
Brady Forrest, Ignite
Brady Forrest, IgniteBrady Forrest, Ignite
Brady Forrest, Ignite
 
Twitter
TwitterTwitter
Twitter
 
Presentatie Social Media & crisiscommunicatie
Presentatie Social Media & crisiscommunicatiePresentatie Social Media & crisiscommunicatie
Presentatie Social Media & crisiscommunicatie
 
The Complete Guide to Twitter
The Complete Guide to TwitterThe Complete Guide to Twitter
The Complete Guide to Twitter
 

En vedette

5 SEO Mistakes that are Costing you Millions (Finance Edition)
5 SEO Mistakes that are Costing you Millions (Finance Edition)5 SEO Mistakes that are Costing you Millions (Finance Edition)
5 SEO Mistakes that are Costing you Millions (Finance Edition)
Powered by Search
 
Acquisition Intern Program
Acquisition Intern ProgramAcquisition Intern Program
Acquisition Intern Program
guest163bca0
 
Freeing The Sexy Beast
Freeing The Sexy BeastFreeing The Sexy Beast
Freeing The Sexy Beast
patrickkubier
 
Noticias Tel abril 2012
Noticias Tel abril 2012Noticias Tel abril 2012
Noticias Tel abril 2012
Francisco Apablaza
 
Oracle virtual server-2-t0-3-upgrade
Oracle virtual server-2-t0-3-upgradeOracle virtual server-2-t0-3-upgrade
Oracle virtual server-2-t0-3-upgrade
Ravi Kumar Lanke
 
Enigma Arabic
Enigma ArabicEnigma Arabic
Enigma Arabic
TELE-satellite ara
 
941A L9 TO W9 Final Assy line
941A L9 TO W9 Final Assy line941A L9 TO W9 Final Assy line
941A L9 TO W9 Final Assy linecvt2go
 
How to deploy rpd and catalog without enterprise manger
How to deploy rpd and catalog without enterprise mangerHow to deploy rpd and catalog without enterprise manger
How to deploy rpd and catalog without enterprise manger
Ravi Kumar Lanke
 
Emakume helduen ahalduntze prozesuak Euskal Autonomia Erkidegoan
Emakume helduen ahalduntze prozesuak Euskal Autonomia ErkidegoanEmakume helduen ahalduntze prozesuak Euskal Autonomia Erkidegoan
Emakume helduen ahalduntze prozesuak Euskal Autonomia Erkidegoanekonomistak
 
Gjav master strategie nutrizionali sport endurance
Gjav master strategie nutrizionali sport enduranceGjav master strategie nutrizionali sport endurance
Gjav master strategie nutrizionali sport enduranceGJAV
 
Hyperion planning installation 9.3.1
Hyperion planning installation 9.3.1Hyperion planning installation 9.3.1
Hyperion planning installation 9.3.1
Ravi Kumar Lanke
 
College SIOB Mediacoach - John Leek - April 15th 2013
College SIOB Mediacoach - John Leek - April 15th 2013College SIOB Mediacoach - John Leek - April 15th 2013
College SIOB Mediacoach - John Leek - April 15th 2013
Netherlands Institute for Sound and Vision
 
Get Your Business Found on Google! (Bahasa Indonesia)
Get Your Business Found on Google! (Bahasa Indonesia)Get Your Business Found on Google! (Bahasa Indonesia)
Get Your Business Found on Google! (Bahasa Indonesia)
jkairupan
 
Advanced Excel, Day 2
Advanced Excel, Day 2Advanced Excel, Day 2
Advanced Excel, Day 2
Khaled Al-Shamaa
 
Biopta company presentation
Biopta company presentationBiopta company presentation
Biopta company presentation
Biopta Inc.
 
Van theoretische kans naar hard voordeel: hoe loopt die route in onze praktijk?
Van theoretische kans naar hard voordeel: hoe loopt die route in onze praktijk?Van theoretische kans naar hard voordeel: hoe loopt die route in onze praktijk?
Van theoretische kans naar hard voordeel: hoe loopt die route in onze praktijk?
Netherlands Enterprise Agency (RVO.nl)
 

En vedette (20)

5 SEO Mistakes that are Costing you Millions (Finance Edition)
5 SEO Mistakes that are Costing you Millions (Finance Edition)5 SEO Mistakes that are Costing you Millions (Finance Edition)
5 SEO Mistakes that are Costing you Millions (Finance Edition)
 
Spaun
SpaunSpaun
Spaun
 
Acquisition Intern Program
Acquisition Intern ProgramAcquisition Intern Program
Acquisition Intern Program
 
Skyworth
SkyworthSkyworth
Skyworth
 
Abcom
AbcomAbcom
Abcom
 
Freeing The Sexy Beast
Freeing The Sexy BeastFreeing The Sexy Beast
Freeing The Sexy Beast
 
Trimax
TrimaxTrimax
Trimax
 
Noticias Tel abril 2012
Noticias Tel abril 2012Noticias Tel abril 2012
Noticias Tel abril 2012
 
Oracle virtual server-2-t0-3-upgrade
Oracle virtual server-2-t0-3-upgradeOracle virtual server-2-t0-3-upgrade
Oracle virtual server-2-t0-3-upgrade
 
Enigma Arabic
Enigma ArabicEnigma Arabic
Enigma Arabic
 
941A L9 TO W9 Final Assy line
941A L9 TO W9 Final Assy line941A L9 TO W9 Final Assy line
941A L9 TO W9 Final Assy line
 
How to deploy rpd and catalog without enterprise manger
How to deploy rpd and catalog without enterprise mangerHow to deploy rpd and catalog without enterprise manger
How to deploy rpd and catalog without enterprise manger
 
Emakume helduen ahalduntze prozesuak Euskal Autonomia Erkidegoan
Emakume helduen ahalduntze prozesuak Euskal Autonomia ErkidegoanEmakume helduen ahalduntze prozesuak Euskal Autonomia Erkidegoan
Emakume helduen ahalduntze prozesuak Euskal Autonomia Erkidegoan
 
Gjav master strategie nutrizionali sport endurance
Gjav master strategie nutrizionali sport enduranceGjav master strategie nutrizionali sport endurance
Gjav master strategie nutrizionali sport endurance
 
Hyperion planning installation 9.3.1
Hyperion planning installation 9.3.1Hyperion planning installation 9.3.1
Hyperion planning installation 9.3.1
 
College SIOB Mediacoach - John Leek - April 15th 2013
College SIOB Mediacoach - John Leek - April 15th 2013College SIOB Mediacoach - John Leek - April 15th 2013
College SIOB Mediacoach - John Leek - April 15th 2013
 
Get Your Business Found on Google! (Bahasa Indonesia)
Get Your Business Found on Google! (Bahasa Indonesia)Get Your Business Found on Google! (Bahasa Indonesia)
Get Your Business Found on Google! (Bahasa Indonesia)
 
Advanced Excel, Day 2
Advanced Excel, Day 2Advanced Excel, Day 2
Advanced Excel, Day 2
 
Biopta company presentation
Biopta company presentationBiopta company presentation
Biopta company presentation
 
Van theoretische kans naar hard voordeel: hoe loopt die route in onze praktijk?
Van theoretische kans naar hard voordeel: hoe loopt die route in onze praktijk?Van theoretische kans naar hard voordeel: hoe loopt die route in onze praktijk?
Van theoretische kans naar hard voordeel: hoe loopt die route in onze praktijk?
 

Similaire à Digital Paleontology - Digging for Ancient Tweets

How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...
How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...
How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...
Lee Yount
 
How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...
How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...
How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...
Lee Yount
 
Who, Why & How We Serve: The Evolution of Collaborative Librarianship Through...
Who, Why & How We Serve: The Evolution of Collaborative Librarianship Through...Who, Why & How We Serve: The Evolution of Collaborative Librarianship Through...
Who, Why & How We Serve: The Evolution of Collaborative Librarianship Through...
University of Michigan Taubman Health Sciences Library
 
Fb Twitter Presentation Cd April19 [Compatibility Mode]
Fb Twitter Presentation Cd April19 [Compatibility Mode]Fb Twitter Presentation Cd April19 [Compatibility Mode]
Fb Twitter Presentation Cd April19 [Compatibility Mode]
Cherie Dargan
 
COM Ethics Final
COM Ethics FinalCOM Ethics Final
COM Ethics Final
carmelapetitto
 
Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization
Tempero UK
 
Legacy 2.0: the democratization of history and the future of stories
Legacy 2.0: the democratization of history and the future of storiesLegacy 2.0: the democratization of history and the future of stories
Legacy 2.0: the democratization of history and the future of stories
Tara Hunt
 
Being There
Being ThereBeing There
Being There
Alan Levine
 
Twitter guide
Twitter guideTwitter guide
Twitter guide
Ismail Muhammad
 
test
testtest
Complete guide to twitter
Complete guide to twitterComplete guide to twitter
twitter-guide
twitter-guidetwitter-guide
twitter-guide
Emelia Linwood
 
Twitter guide
Twitter guideTwitter guide
Twitter guide
Digital Pymes
 
Twiiter Information
Twiiter InformationTwiiter Information
Twiiter Information
jpttmcbds
 
the complete Twitter Guide
the complete Twitter Guide the complete Twitter Guide
the complete Twitter Guide
Rania Alahmad
 
Who, Why & How We Serve: Healthcare Communities, Librarians & Social Media
Who, Why & How We Serve: Healthcare Communities, Librarians & Social MediaWho, Why & How We Serve: Healthcare Communities, Librarians & Social Media
Who, Why & How We Serve: Healthcare Communities, Librarians & Social Media
University of Michigan Taubman Health Sciences Library
 
True Stories of Openness (Yavapai College)
True Stories of Openness (Yavapai College)True Stories of Openness (Yavapai College)
True Stories of Openness (Yavapai College)
Alan Levine
 
Twitter in the classroom
Twitter in the classroomTwitter in the classroom
Twitter in the classroom
Highland Park High School
 
Changes
ChangesChanges
Changes
Peter Møller
 
Info2011
Info2011Info2011
Info2011
Daniel Lipson
 

Similaire à Digital Paleontology - Digging for Ancient Tweets (20)

How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...
How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...
How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...
 
How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...
How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...
How To Set Up Social Media Sites - Some new stuff | Crash, BOOM, tweet…with a...
 
Who, Why & How We Serve: The Evolution of Collaborative Librarianship Through...
Who, Why & How We Serve: The Evolution of Collaborative Librarianship Through...Who, Why & How We Serve: The Evolution of Collaborative Librarianship Through...
Who, Why & How We Serve: The Evolution of Collaborative Librarianship Through...
 
Fb Twitter Presentation Cd April19 [Compatibility Mode]
Fb Twitter Presentation Cd April19 [Compatibility Mode]Fb Twitter Presentation Cd April19 [Compatibility Mode]
Fb Twitter Presentation Cd April19 [Compatibility Mode]
 
COM Ethics Final
COM Ethics FinalCOM Ethics Final
COM Ethics Final
 
Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization
 
Legacy 2.0: the democratization of history and the future of stories
Legacy 2.0: the democratization of history and the future of storiesLegacy 2.0: the democratization of history and the future of stories
Legacy 2.0: the democratization of history and the future of stories
 
Being There
Being ThereBeing There
Being There
 
Twitter guide
Twitter guideTwitter guide
Twitter guide
 
test
testtest
test
 
Complete guide to twitter
Complete guide to twitterComplete guide to twitter
Complete guide to twitter
 
twitter-guide
twitter-guidetwitter-guide
twitter-guide
 
Twitter guide
Twitter guideTwitter guide
Twitter guide
 
Twiiter Information
Twiiter InformationTwiiter Information
Twiiter Information
 
the complete Twitter Guide
the complete Twitter Guide the complete Twitter Guide
the complete Twitter Guide
 
Who, Why & How We Serve: Healthcare Communities, Librarians & Social Media
Who, Why & How We Serve: Healthcare Communities, Librarians & Social MediaWho, Why & How We Serve: Healthcare Communities, Librarians & Social Media
Who, Why & How We Serve: Healthcare Communities, Librarians & Social Media
 
True Stories of Openness (Yavapai College)
True Stories of Openness (Yavapai College)True Stories of Openness (Yavapai College)
True Stories of Openness (Yavapai College)
 
Twitter in the classroom
Twitter in the classroomTwitter in the classroom
Twitter in the classroom
 
Changes
ChangesChanges
Changes
 
Info2011
Info2011Info2011
Info2011
 

Plus de martin255

Kontrast@TOTh 2012
Kontrast@TOTh 2012Kontrast@TOTh 2012
Kontrast@TOTh 2012
martin255
 
Kontrast@TKE 2012
Kontrast@TKE 2012Kontrast@TKE 2012
Kontrast@TKE 2012
martin255
 
Classificateur d'URL
Classificateur d'URLClassificateur d'URL
Classificateur d'URLmartin255
 
Classificateur d'URL
Classificateur d'URLClassificateur d'URL
Classificateur d'URLmartin255
 
Architecture procédurale
Architecture procéduraleArchitecture procédurale
Architecture procédurale
martin255
 
L'unité documentaire sur le web
L'unité documentaire sur le webL'unité documentaire sur le web
L'unité documentaire sur le web
martin255
 

Plus de martin255 (6)

Kontrast@TOTh 2012
Kontrast@TOTh 2012Kontrast@TOTh 2012
Kontrast@TOTh 2012
 
Kontrast@TKE 2012
Kontrast@TKE 2012Kontrast@TKE 2012
Kontrast@TKE 2012
 
Classificateur d'URL
Classificateur d'URLClassificateur d'URL
Classificateur d'URL
 
Classificateur d'URL
Classificateur d'URLClassificateur d'URL
Classificateur d'URL
 
Architecture procédurale
Architecture procéduraleArchitecture procédurale
Architecture procédurale
 
L'unité documentaire sur le web
L'unité documentaire sur le webL'unité documentaire sur le web
L'unité documentaire sur le web
 

Digital Paleontology - Digging for Ancient Tweets

  • 1. Digital Paleontology Digging for Ancient Tweets Martin Lafréchoux JITSO 2012 – EPFL, December 4th, 2012 Full text of the Research notes is available on the JITSO website and on my research blog -- Pic : www.flickr.com/photos/mag3737/307016400/
  • 2. Hi. My name is Martin Lafréchoux. I am a PhD student at Paris Ouest Nanterre. My dissertation deals with the web page as a document. I'll try not to repeat the content of the research notes published on the website. I'd rather address some points that were left out due to space constraints or because they were not ready for publication. Mostly I’ll talk about the various twitter APIs and their implications to the researcher. But first I'd like to try and explain the title of the presentation. Digital Paleontology - why ?
  • 3. - Twitter is live. You watch twitter as you would TV, to see what's happening right now. When a piece of information spreads on twitter, it creates a cascade Cascades are live data, best observed in the wild ; that's the whole point of just-in-time sociology. The cascade goes on for a short while, then it disappears. Twitter offers several ways to observe these cascades. We will go through them briefly It’s a bit hard to represent something as vivid in a slide, so I hope you can pardon me for using shaky visual metaphors
  • 4. Basic use: visiting twitter.com The most simple way is to observe information cascades is in the wild, using the twitter web client. You’re in the middle of the action, but that’s not always the best place to see the big picture. If that’s not precise enough, twitter offers several API with distinct characteristics. -- Pic : http://www.public-domain-image.com/
  • 5. Real Time: the Streaming API The most complete API is called the streaming API. Complete access to all posted tweets is called the firehose. -- Pic : http://www.flickr.com/photos/usnavy/5887790560/
  • 6. Access is pretty limited, both expensive and exclusive. Gaining access to the firehose is one thing. Then you have to have the stomach to drink from it.
  • 7. October 2012 average : 500 000 000 / day A complete recording for a month of twitter in january 2011 is about 1 billion tweets (Myers & Leskovec, 2012), that’s about 400 unique tweets per second on average, not counting RTs. That’s for an average day, almost two years ago. In October twitter CEO reported 500 million tweets per day. http://news.cnet.com/ 8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/ That’s about 5800 per second. -- Pic: http://www.flickr.com/photos/chrish_99/7431798496/
  • 8. There is not much you can do with such an astonishing stream of data besides displaying it, as twitter does, or storing it to analyze it later. Even that requires huge resources. -- Pic: http://www.flickr.com/photos/thomashawk/5683179189/
  • 9. Digital Taxidermy ? Once recorded, the live data is frozen. The cascades are turned in something like a stuffed animal. Digital taxidermy, if you will. -- Damien Hirst, The Physical Imbossibility of Death in the Mind of Someone Living Photo: http://www.flickr.com/photos/chaostrophy/2594401926/
  • 10. Just-in-time: The Search API The next best option is to use the Search API to analyze tweets in real-time, as they are published. After real-time recording, you get just-in-time. For 6 to nine days, you can use the twitter search API. The search API would be like an underwater tunnel, giving you easy access to the closest data. -- Pic: http://www.flickr.com/photos/lorensztajer/4201751064/
  • 11. A bit too late : the REST API But what happens after that, after real-time? When you are just-too-late? After a week or so, you only have the REST API available. It means you cannot search, and you only get 150 API calls an hour. You can view the details of a tweet or a user if you already know where to look. As the twitter dev docs put it, you need to get «creative» to get what you want with such limitations. -- Pic: http://www.flickr.com/photos/stinkenroboter/6604532503/
  • 12. Much too late... Digital Paleontology You don’t have access to the real, live cascade anymore, but you can still see its shadow, its ghost. And if you spend enough time putting back together the scattered pieces, it can give you a good idea of what the cascade was. So, what happens when you were too late for just-in-time amounts to me digital paleontology. There are bits and pieces of data, scattered all over the web if you only know where to dig. For my PhD, I am trying to work a bit of digital paleontology to reconstruct an information cascade that happened more than a year and a half ago, on may 2, 2011. As you may have guessed, I originally tried to map the cascade about 15 days after the fact using the twitter API, and came back sorely disappointed. I tried again six months later, and this presentation is the result of my work. -- Pic: http://www.flickr.com/photos/14508691@N08/4531324072/
  • 13. May 01, 2011 – 03:58PM ET On twitter, it began like this. A Pakistani can’t sleep because of a helicopter. (Timestamps are crucial. I have converted everything to Eastern Time for clarity.)
  • 14. May 01, 2011 – 10:24PM ET Six and a half hours later, wrestler and entertainer Dawyne ‘The Rock’ Johnson posts a rather cryptic tweet.
  • 15. May 01, 2011 – 10:24PM ET At the same exact minute, former Rumsfeld Chief of Staff posts a more explicit one.
  • 16. People begin gathering in front of the White House, waiting for the press conference. -- Pic: http://www.flickr.com/photos/theqspeaks/5679548043/
  • 17. May 02, 2011 – 11:35PM ET An hour later, Barak Obama appeared on TV to announce the news. Responses were mixed. -- Pic: http://www.flickr.com/photos/us_embassy_newzealand/5682145416/
  • 18. Some Americans were very vocal in their enthusiasm. Pic: http://www.flickr.com/photos/zokuga/5678699597/
  • 19. “I have never wished a man dead, but I have read some obituaries with great pleasure.” Mark Twain Somewhat less enthusiastic Americans expressed their feelings by tweeting this quote by Mark Twain. ... which is actually by civil rights lawyer Clarence Darrow. -- Pic : http://en.wikipedia.org/wiki/File:Mark_Twain,_Brady- Handy_photo_portrait,_Feb_7,_1871,_cropped.jpg
  • 20. “I have never wished a man dead, but I have read some obituaries with great pleasure.” Clarence Darrow Somewhat less enthusiastic Americans expressed their feelings by tweeting this quote by Mark Twain. ... which is actually by civil rights lawyer Clarence Darrow. -- Pic : http://en.wikipedia.org/wiki/File:Mark_Twain,_Brady- Handy_photo_portrait,_Feb_7,_1871,_cropped.jpg
  • 21. It seems that Mark Twain is to the US as Winston Churchill is to the UK or Jules Renard to France: funny quotes are attributed to him by default. The mistake probably comes down to this page. You can imagine how this would look as a Google snippet. http://www.estatevaults.com/lm/archives/2006/08/16/twain_and_darro.html
  • 22. “I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.” Martin Luther King, Jr. Some Americans were appalled that bin Laden was killed rather than taken into custody. Many chose to tweet this quote of Martin Luther King to express their feelings. As you may have guessed, Martin Luther King never said or wrote this sentence. And this is what I will be talking about today, after a rather lengthy introduction. Now, how did this happen? How did this misattributed quote go viral? Now that’s a job for a digital paleontologist. -- Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg
  • 23. “I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.” Some Americans were appalled that bin Laden was killed rather than taken into custody. Many chose to tweet this quote of Martin Luther King to express their feelings. As you may have guessed, Martin Luther King never said or wrote this sentence. And this is what I will be talking about today, after a rather lengthy introduction. Now, how did this happen? How did this misattributed quote go viral? Now that’s a job for a digital paleontologist. -- Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg
  • 24. “I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy.” ? Some Americans were appalled that bin Laden was killed rather than taken into custody. Many chose to tweet this quote of Martin Luther King to express their feelings. As you may have guessed, Martin Luther King never said or wrote this sentence. And this is what I will be talking about today, after a rather lengthy introduction. Now, how did this happen? How did this misattributed quote go viral? Now that’s a job for a digital paleontologist. -- Pic: http://en.wikipedia.org/wiki/File:Martin_Luther_King_Jr_NYWTS.jpg
  • 25. The cascade We don’t often get to see the starting point of an information epidemic, but in this case it is known. -- Pic: http://www.flickr.com/photos/vilseskogen/3279138165/
  • 26. It all started from this facebook post. Jessica Dovey, who teaches English in Kobe, Japan posts the following message on her Facebook wall.
  • 27. May 02, 2011 – 12:15PM ET It’s interesting because : - we don’t always get access to the starting point of a cascade - we don’t often get to see Facebook at all. I was prepared to say that we have no idea what happened exactly, but I’ll make a guess.
  • 28. All that we know is that at some point, someone stripped the quote of anything that was actually written by MLK and presented him as the author. Here is what (to the best of my knowledge) happened.
  • 29. All that we know is that at some point, someone stripped the quote of anything that was actually written by MLK and presented him as the author. Here is what (to the best of my knowledge) happened.
  • 30. 02:22PM ET A user posted the whole quote to twitlonger, mistakenly attributing the whole of it to MLK. Quite a few tweets point to this page (it says 80, Topsy recorded a dozen or so)
  • 31. Many of these tweets have been deleted, probably in shame when their authors realized they had posted a fake quote.
  • 32. And then someone posted the first part of the misattributed quote on twitter. We only have a faint trace it left on Topsy, a twitter archiving service.
  • 33. 02:42PM ET Fortunately for us, some twitter archives exist. Here is the earliest recorded tweet. (Theses tweets are called the ‘quote corpus’ my research notes.)
  • 34. 02:52PM ET Ten minutes later, a properly formatted quote reaches a so-called ‘Influential account’
  • 35. 03:15PM ET 25 minutes later, the quote is reposted by Penn Jillette, a pretty famous US magician. At that point, several things happen. (a) The cascade accelerates exponentially (b) Some people, mostly journalists, begin to doubt the authenticity of the quote.
  • 36. May 2, 6:23PM Megan McArdle, then an editor at The Atlantic, publishes a blog post at the end of the afternoon where she expresses her doubts as to the authenticity of the quote. -- Newspaper icon: http://thenounproject.com/noun/newspaper/#icon-No1233
  • 37. It’s thanks to the work of Megan McArdle at the Atlantic that we have Dovey’s screencap. She obtained it via conventional journalism techniques, such as e-mailing a human being. That’s significant because we would never have gained access to a private FB post via an API. Journalists have their own investigation techniques. They are a tad low-tech but this is changing, thanks to the ‘data-journalism’ trend currently going on. (This comes with its own set of issues which I won’t discuss here). She followed up with a second article where she identifies Jessica Dovey.
  • 38. Penn Jillette retracts his previous tweet as soon as he realizes his mistake. This tweet does not really attract the same kind of attention as the quote.
  • 39. May 3, 03:20PM ET The interesting thing is that media coverage modified - almost bent - the cascade as it was still unfolding. Journalists were there at the right time, when it was still happening. But they did not know how to use an API. This Salon.com article mistakenly believes that PJ was the first one to post the tweet on twitter.
  • 40. Why did Penn Jillette create a fake Martin Luther King Jr. quote yesterday? - Twitter - Salon.com May 3, 03:20PM ET The interesting thing is that media coverage modified - almost bent - the cascade as it was still unfolding. Journalists were there at the right time, when it was still happening. But they did not know how to use an API. This Salon.com article mistakenly believes that PJ was the first one to post the tweet on twitter.
  • 42. Salon then posted a follow up piece, which was actually deleted at some later point - I don’t know when, I just realized it while working on these very slides. A few days ago I could still access it using Google Cache, which is another form of ‘ghosting’.
  • 43. And that was the cache last night. Penn Jilette did not delete his tweets, which is a rather classy move on his part.
  • 44. Then the big players and blogs take it from there. kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp) And Jason Kottke, who actually wrote what I consider to be the best write-up.
  • 45. Then the big players and blogs take it from there. kottke, wapo, CNN, AllthingsD, New Yorker (avec timestamp) And Jason Kottke, who actually wrote what I consider to be the best write-up.
  • 46. May 2, 06:26PM ET From there, the cascade branches again, creating a new twitter track. Here is the first tweet identifying the quote as fake. It was posted a few minutes after Megan McArdle’s first blog post but does not contain a link.
  • 47. Most tweets calling out the MLK quote as fake contain a link to some type of blog post or news article. My hypothesis is that people are more likely to include a link to some outside source when their tweet goes 'against the flow'. It gives more weight to their tweets. Dissenting voices have a hard time being heard on twitter. [These tweets are called the ‘links corpus’ my research notes.]
  • 48. As you can see, the peak of the links corpus (yellow) comes about 24h after the ‘quotes corpus’ peak.
  • 49. Tools & Evaluation A quick review of the tools used for this presentation. -- Pic: http://www.flickr.com/photos/anomieus/6205869109/
  • 50. [journalists] If I may go back to the wildlife metaphor for a second, journalists are a bit like a crew filming a wildlife documentary. Journalists were instrumental in this whole affair. Their analog worked and got results that would have been impossible to obtain using API (most notably on Facebook). -- Pic: http://www.public-domain-image.com/public-domain-images-pictures-free-stock- photos/people-public-domain-images-pictures/people-filming-in-wild-nature.jpg
  • 51. Some of them used a bit of google-fu to identify the quote as fake. With a date range filter, you can see wether a quote just appeared on the web or not. http://www.pcworld.com/article/226912/google_daterange_filter.html But their lack of knowledge of the inner workings of twitter showed. Salon blamed PJ, and no one was unable to identify the first twitter posts - less than 48 hours after the beginning of the cascade. The twitter Search API was still available. Salon, for example, does not reveal how they found their results, but obviously the methods were suboptimal (my guess is that they did not realize that twitter displays so-called ‘Top results’ as default search results). That's changing as we speak. Data journalism is probably going to be featured in a number of 2012 buzzwords list.
  • 52. [data journalists] [This is probably what data journalists look like.] -- Pic: http://www.public-domain-image.com/
  • 53. As for my own tools, the results shown today rely mostly on a twitter archiving service called Topsy. When twitter announced its new terms of service this summer, almost all members of the twitter ecosystem cried in terror. Not Topsy. Topsy is exactly the kind of service twitter wants as a partner. This is the kind of ecosystem twitter is trying to build.
  • 54. There are various twitter archiving services. I chose Topsy because it is the most fully featured free twitter archive that I could find. The crucial point was that it allowed to use a date range filter on the search.
  • 55. Topsy even offers an API, otter. You can interact with it using python-otter, a python library, or directly using the REST API.
  • 56. Topsy archives... > tweets containing a link, or > tweets that were retweeted How much does Topsy archive ? The first question would be ‘What does Topsy archive?’ Tweets are recorded when either (a) they are retweeted by someone else or (b) they contain a link
  • 57. Interesting caveat: Topsy records tweets when they are published. It means that if you delete a tweet after Topsy has archived it, the tweet will be removed from twitter, but not from Topsy. It’s something to keep in mind: Topsy offers something like a ghost image, a remanent image.
  • 58. Interesting caveat: Topsy records tweets when they are published. It means that if you delete a tweet after Topsy has archived it, the tweet will be removed from twitter, but not from Topsy. It’s something to keep in mind: Topsy offers something like a ghost image, a remanent image.
  • 59. ‘quote’ corpus 2657 different authors in Topsy data 394 were unavailable on twitter ( ) Hidden : 94 Suspended : 4 Closed : 296 ghosting effect on the ‘quotes’ corpus
  • 60. test sample 174368 tweets, including : 3140 RT 18400 links Topsy’s coverage ≈12,1% Topsy archives tweets that were retweeted or tweets containing an URL. Topsy archives tweets *that are links*. This is the value. The textual content is impossible / not cost- effective to exploit right now. To evaluate our results is to evaluate Topsy It has become pretty hard to get a twitter dataset : twitter recently changed its terms of service and several previously available datasets have become unavailable. I got a sample from the streaming API. It contains about 175000 tweets, which amounts to a few minutes of firehose output. On our sample, Topsy’s coverage would be about 12%, excluding duplicates
  • 62. •Resource decay is not uniform •Linked content has more value •Media coverage plays an important part in the phenomenon and in what survives At least in the case of twitter, decay is not evenly distributed. Content that survives is content that was interesting to someone, content that is linked. The value is the link, not the content. In our case, it’s worth considering that media coverage both unearthed some content (the FB screenshot) and buried other pieces of information (deleted tweets). Journalists and data investigation techniques are complementary. Data journalism will probably fuse the two in the long run.
  • 63. Thank you m.lafrechoux@gmail.com http://nologos.net/ @lagayascienza Thank you for your attention. Pic: http://www.flickr.com/photos/mmechtley/5379944214/