I. Sn@tch would allow for real-time capture of tweets along with proactive archiving of embedded resources like images and videos.
II. Rapid analysis capabilities would enable identifying real-time opportunities for collecting important information.
III. Sn@tch would facilitate collection-agnostic linking between different datasets like news archives and social media, even when they use different terminology to describe the same events.
1. Sn@tch:
An Archiving and Analysis
Service for Global News
Todd Grappone @liber8er
Sharon Farb @farbthink
Martin Klein @mart1nkle1n
Peter Broadwell @peterbroadwell
3. International Collecting
• 829 digitally recorded Iranian dissident news programs
• 9,166 other videos from the Iranian Green Movement
• 29,441 digital photographs from the Green Movement
• 543 documents from Tahrir Square
4. News and Perspectives
The UCLA NewsScape:
• >228,000 hours of TV news
• Recorded 2005-present
• 13 countries, 9 languages
• 38 networks
• Searchable by captions, on-
screen text, named entities
• How to incorporate social media
into this variety of perspectives?
6. A Brief History of Timeliness
• Twitter archive at the Library of Congress [1]
• Last public update from January 4th 2013
• ~170 billion tweets, > 130 TB compressed (late 2012)
• Single search against 2006-2010 data may take up to 24 hours
• Twitter data access at Massachusetts Institute of Technology,
Laboratory for Social Machines [2]
• Public announcement from October 1st 2014
[1] http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archive-at-the-library-of-congress/
[2] https://blog.twitter.com/2014/investing-in-mit-s-new-laboratory-for-social-machines
7. A Brief History of Timeliness
In case you missed it:
• Twitter makes full archive
of tweets available,
indexed
• Great, problem solved?
• How about deleted
tweets?
• Real-time capture of
embedded resources?
https://blog.twitter.com/2014/building-a-complete-tweet-index
8. A Brief History of Timeliness
• Many initiatives to capture Twitter data
• Live, after an event, both
• Mostly ad-hoc efforts, rarely institutionalized
• Operation often requires programming or sys admin skills
• Deen Freelon’s (American University) incomplete list of tools:
https://docs.google.com/document/d/1UaERzROI986HqcwrBDLaqGG8X_lY
wctj6ek6ryqDOiQ/
9. A Brief History of Timeliness
Social Feed Manager (Dan Chudnov, GWU); as presented at
#cni13f
http://social-feed-manager.readthedocs.org/
10. A Brief History of Timeliness
twarc (Ed Summers, MITH); used for Ferguson
data
http://inkdroid.org/journal/2014/08/30/a-ferguson-twitter-archive/http://files.archivists.org/conference/nola2013/twitter/twarc-saa13.htm
11. We Can
Remember It for
You Wholesale
I. Real-time capture of
tweets plus pro-active
archiving of embedded
resources
II. Rapid analysis, real-
time opportunities
III. Collection-agnostic
linking
12. Remembrance of Tweets/Links Past
• Utilize GWU’s Social Feed Manager
• Filter by keywords, user handles, location, time, etc
• Store raw tweets
• Extract and archive embedded URIs
• Utilize pro-active archiving solutions: Internet Archive,
archive.today
13. Remembrance of Tweets/Links Past
• UCLA’ s dataset about Egyptian revolution
• More than 400k tweets
• Approx. 50k unique users
• Tweets originated from within 200 miles around Cairo
14. Remembrance of Tweets/Links Past
• UCLA’ s dataset about Egyptian revolution
• 25% of tweets contain references to external resources
(web pages, images, videos, etc)
20. Remembrance of Tweets/Links Past
URIs from Ed Summer’s Ferguson
dataset
https://edsu.github.io/ferguson-urls/
pink == not archived
(Internet Archive)
28%
27. Raiders of the Lost Links
Challenges and opportunities:
• Legal frameworks for sharing and preserving tweets and linked
resources
• Collaborations and partnerships to ensure momentum, sustainability
• Expansion to other forms of (social) media
28. Lazy Digital Archivists: Your Time is Up
Todd Grappone grappone@library.ucla.edu
Sharon Farb farb@library.ucla.edu
Martin Klein martinklein@library.ucla.edu
Peter
Broadwell
broadwell@library.ucla.edu
Editor's Notes
Todd: Intro, Motivation
In recent years, the Library has become the steward of digital ephemera materials
In an ever-growing variety of formats, especially when social media is included
Many materials at UCLA are from the “Arab Spring” and related movements (2009-?)
This represents both a challenge and an opportunity
Todd: Scholars, students, and the public request that the library host, preserve, and make available these materials
AND ALSO
Provide a suite of service for live capture, analysis, tagging, summarization, and linking of materials
So far, we’ve focused on Twitter as the main form of social media of interest to researchers.
With the understanding that Twitter collections are most useful when analyzed in bulk and linked to other materials
Todd: Providing a Rashomon-like, multiperspective history service
A vital opportunity – and responsibility – of collecting digital news ephemera, especially about recent events, is to collect and present multiple perspectives on the events.
“Official” state and corporate TV media in various countries
Newspapers are interesting too, but diminishing in influence
Independent media, alternative news sources, online sources (incl. blogs) increasingly influential
Social media, from different sources, are also vital and may provide sharply contrasting viewpoints if you can filter the signal from the noise
Personal media (incl. those linked from social media) are another important piece of the puzzle
Martin: State of the art
twitter "backups" at LoC, collects *everything*, not suitable for us
Dec 2012:
Approx 170 billion tweets, >130TB compressed
Grows by half a billion tweets per day
Single search against 2006-2010 data may take up to 24hrs
Several hundred researcher requests for access, non granted
State of uncertainty
MIT has access to full stream plus archive
Access uncertain, collaboration in infant stages
Martin: State of the art
Twitter announcement of indexed archive of all tweets available
Game changer but does not solve the problem
Martin: State of the art
- realization of multiple ad-hoc initiatives to capture tweets, live, after the event, both
- timing issue, may be too late to capture stuff *check twitter api, how far back can we go?*
- all building silos, no connection, no collaboration
Martin
Martin
Martin: Decision at UCLA
- abstract to higher level and institutionalize as service
- 3 pillars:
1) real-time capture including preservation of embedded resources at capture time (!!!)
2) (real-time) rapid analysis
3) collection-agnostic linking
“Get your ass to Mars!”
Martin: Implementation level #1:
- SFM
- filtering by hashtag, user handle, keyword search, location, time, etc
- extraction of URIs, pro-active archiving of resources, remote for now
Martin: Implementation level #1:
Concrete example of Egypt dataset
Martin: Implementation level #1:
Concrete example of Egypt dataset
Martin: Implementation level #1:
Concrete example of Egypt dataset
Martin: Implementation level #1:
Concrete example of Egypt dataset
Martin: Implementation level #1:
Concrete example of Egypt dataset
Martin: Implementation level #1:
Concrete example of Egypt dataset
Martin: Implementation level #1:
Concrete example of Egypt dataset
Martin: Implementation level #1:
If someone needs another (non-UCLA) example
Martin: pro-actively archived URIs from CNI tweets
Pete: Implementation level #2:
more conceptual at this stage
Emphasizes usability (your average faculty member), flexibility of the service
Focus first on "low-hanging fruit" such as word cloud, histograms, geospatial visualization
Example: Twitter data scientist Simon Roger’s mapping of reactions to the Ferguson grand jury’s decision on Twitter
Pete: Implementation level #2:
Capture and adapt to evolution of corpus context, occurrence of new hashtags, keywords, etc.
Again, goal is not to reinvent the wheel; rather, encourage collaboration and use of common frameworks
Example: use D3.js, text mining tools available
Live demo; also pre-recorded: using a lightly modified version of GWU’s Social Feed Manager
Feeds live-capture of tweets via Twitter’s API into to Node.js and D3 real-time visualizations
Term cloud on #cnif14
Term cloud from reactions to Senate’s “no” vote re: KeystoneXL pipeline
Analysis based on counts of terms, user handles, hashtags, sentiment analysis
Pete: Implementation level #3:
- More conceptual at this stage
- Background: involves desires we’ve had about both the DEP collection and also NewsScape
- linking as the logical next (desired) step, to enhance the collections and highlight the interconnectedness of multi-perspective news accounts
- linking to:
- (potentially missing) embedded resources
- related content in other collections, including news media
- Example here: mutually distrustful symbiosis of 24-hour “breaking” TV news outlets and Twitter (Boston Marathon bombing)
Pete: Implementation level #3:
Better linking technologies and practices would facilitate cross-collection news analyses like this one:
Comparing the volume of tweets from Cairo about the Tahrir Square protests to US TV news coverage about them (from NewsScape)
We have similar comparisons for the early days of the Libya civil war and the March 3, 2011 earthquake, tsunami, nuclear crisis in Japan
Enables researchers to ask new and more sophisticated questions, get a better sampling of the variety of the recorded reactions to these events
Events to point out: 28 Jan “day of rage”, Internet blackout in Egypt until 3 February, Mubarak’s defiant statement, then resignation; weekends in TV news
Pete: Implementation level #3:
Another example from the Egyptian revolution, involving potentially missing embedded resources
A tweet linking to a TV news resource that is no longer available and wasn’t formally archived, BUT
Using enhanced search and linking tools, we can find news coverage of this event
and actually many more perspectives on it: a half-dozen different news networks, other Twitter users, other social media? Newspapers?
Sharon: Challenges
- legal issues for sharing collected data, preserving tweets and embedded resources
- building and maintaining momentum for such efforts, seen in the past that ad-hoc doesn't scale, yet interest is growing, not agreed on model for approaching this
- collaborations and partners: GWU, Stanford, UNT, interested web archives
- expand to other forms of (social) media