New Methodologies for Capturing and Working with Publicly Available Twitter Data
1. New Methodologies for
Capturing and Working
with Publicly Available
Twitter Data
Associate Professor Axel Bruns
@snurb_dot_info
http://mappingonlinepublics.net/
Queensland University of Technology
2. WHY TWITTER?
• Researching Twitter:
– Significant world-wide social network
– ~500 million accounts (but how many active?)
– Varied range of uses: from phatic communication to emergency coordination
– Healthy third-party ecosystem (for now)
– Strong history of user innovation:
@replies, #hashtags
– Flat and open network structure:
non-reciprocal following, public profiles by default
– Good API for gathering (big) data for research
3. NEW MEDIA AND PUBLIC COMMUNICATION:
MAPPING AUSTRALIAN USER -CREATED CONTENT
IN ONLINE SOCIAL NETWORKS
• Australian Research Council (ARC) Discovery Project (2010-13) – $410,000
– QUT (Brisbane), Sociomantic Labs (Berlin)
– First comprehensive study of Australian social media use
– Computer-assisted cultural analysis: tracking, mapping, analysing blogs, Twitter, Flickr,
YouTube as ‘networked publics’
– Addressing the problem of scale (‘Big Data’) and disciplinary change in media, cultural and
communication studies – natively digital methods
– Studying society with the Internet (Richard Rogers)
http://mappingonlinepublics.net/
4. A TWITTER RESEARCH TOOLKIT
• Data Gathering
– yourTwapperkeeper + in-house crawler
• Data Processing
– Gawk – open source, multiplatform, programmable command-line tool for
processing CSV documents
• Textual Analysis
– Leximancer – commercial, multiplatform: extracts key concepts from large
corpora of text, examines and visualises concept co-occurrence
– WordStat – commercial, PC-only text analysis tool; generates concept co-
occurrence data that can be exported for visualisation
• Visualisation
– Gephi – open source, multiplatform network visualisation tool
6. APPROACHING TWITTER
• Possible research questions:
– Hashtags as vehicles for ad hoc events and publics:
• How do online publics form and dissolve? How do they interact, what
structures do they form?
• Where do they draw information from? What do they share?
• Do they simply consist of the usual suspects? How insular and disconnected
are online publics?
– Hashtags in context:
• How do different hashtag events compare? Are there common types of
hashtags/publics?
• How ‘big’ are they? What topics attract attention on Twitter?
• What community (?) structures emerge?
7. DEVELOPING TWITTER METRICS
• Key data points available through the Twitter API:
– text: contents of the tweet itself, in 140 characters or less
– to_user_id: numerical ID of the tweet recipient (for @replies)
– from_user: screen name of the tweet sender
– id: numerical ID of the tweet itself
– from_user_id: numerical ID of the tweet sender
– iso_language_code: code (e.g. en, de, fr, ...) of the sender’s default language
– source: client software used to tweet (e.g. Web, Tweetdeck, ...)
– profile_image_url: URL of the tweet sender’s profile picture
– geo_type: format of the sender’s geographical coordinates
– geo_coordinates_0: first element of the geographical coordinates
– geo_coordinates_1: second element of the geographical coordinates
– created_at: tweet timestamp in human-readable format
– time: tweet timestamp as a numerical Unix timestamp
8. DEVELOPING TWITTER METRICS
• Additional data points from tweets:
– original tweets: tweets which are neither @reply nor retweet
– retweets: tweets which contain RT @user… (or similar)
• unedited retweets: retweets which start with RT @user…
• edited retweets: retweets do not start with RT @user…
– genuine @replies: tweets which contain @user, but are not retweets
– URL sharing: tweets which contain URLs
• Potential uses:
– metrics per hashtag
– metrics per timeframe (day, hour, minute, second, …)
– metrics per user (or group of users)
– …
(Bruns & Stieglitz, forthcoming)
13. BEYOND HASHTAGS
• Publics on Twitter:
– Micro: @reply and retweet conversations
– Meso: follower/followee networks
– Macro: hashtag ‘communities’ (Bruns & Moe, forthcoming)
Multiple overlapping publics / networks
• What drives their formation and dissipation?
• How do they interact and interweave?
• How are they interleaved with the wider media ecology?
• Twitter doesn’t contain publics: publics transcend Twitter
14. ‘BIG DATA’ AND THE DIGITAL HUMANITIES
• Emerging needs in Twitter research:
– Unified, compatible methods and metrics for Twitter analysis
Tools and approaches shared at http://mappingonlinepublics.net/
– Powerful infrastructure for long-term, high-volume tracking of public
communication on Twitter
Data access requires substantial funding stream
– Facilities for long-term data storage and preservation
Key roles for National Libraries, National Archives
– Integration with related datasets (e.g. MSM content)
Need to address data interoperability questions
– Robust frameworks for Internet research ethics
Clear guidelines which take into account complex new public/private structures
• Twitter as a test case for digital humanities research
– Widespread, open, public platform for everyday communication
– Tool for observing society at scale through Internet research