This document discusses tools and methods for collecting Twitter data for research purposes. It outlines Twitter's APIs that can be used to extract tweet data through queries or streams. It also notes legal issues around Twitter's terms of service and API rules. Several desktop and hosted tools for collecting and storing Twitter data are presented. Finally, it discusses different sampling approaches for gathering a representative subset of tweets, such as by hashtag or location filters.
Elektronisches Publizieren und Open Access für Geistes- und Sozialwissenschaf...
Collecting Twitter Data
1. Collecting Twitter data
Dr. Cornelius Puschmann
School of Library and Information Science
Humboldt-University of Berlin /
Humboldt Institute for Internet and Society
16 April 2013
Royal Statistical Society
2. Overview
1. Examples of research using Twitter data
2. Twitter's data infrastructure
3. Tools for collecting data
4. Sampling issues
3. Examples of research using
Twitter data
• Kwak, H., Lee, C., Park, H., & Moon, S. (2010). What is Twitter, a Social
Network or a News Media ? Categories and Subject Descriptors.
Proceedings of the 19th International Conference on the World Wide Web
(WWW ’10) (pp. 591–600). Raleigh, NC.
• González-Bailón, S., Borge-Holthoefer, J., Rivero, A., & Moreno,Y. (2011). The
dynamics of protest recruitment through an online network. Scientific
reports, 1, 197. doi:10.1038/srep00197
• Ausserhofer, J., & Maireder, A. (2013). National politics on Twitter: Structures
and topics of a networked public sphere. Information, Communication &
Society, 16(3), 291–314. doi:10.1080/1369118X.2012.756050
• Papacharissi, Z., & De Fatima Oliveira, M. (2012). Affective News and
Networked Publics: The Rhythms of News Storytelling on #Egypt. Journal of
Communication, 62(2), 266–282. doi:10.1111/j.1460-2466.2012.01630.x
4. Example questions
Twitter as a platform
• How can Twitter's structure be described?
Social graph
• Who follows whom?
• How does information spread?
Hashtags, keywords, and geography
• How can the discussion of topic X be characterized?
• Who is participating in discussions on X?
• Where are users discussing X?
5. Example questions
URLs in Twitter
• How is mass media content discussed?
• How are academic papers cited on Twitter?
Creative approaches
• Where, when, and with what devices do people
call taxis?
Prediction/application
• Can election results/flu outbreaks/consumption
patterns be reliably predicted?
8. Extracting Twitter data
HTTP request
return all data from a given user/hashtag/geolocation/...
Application Programming
Interface (API)
Data (usually in a database or spreadsheet)
10. Three Twitter APIs
REST API 1) data: tweets,API
Streaming social graph
Search API
• traditionally used complex tools needed • same functionality
2) • public, user, and
by most client 3) constraints on how
site streams as Twitter search
software much data can data in •
• provides be captured rate-limited
• v1.0 will be phased real time and
out in May 2013 largely
• to be replaced by unprocessed as it
more restrictive flows through the
v1.1 platform
11. Legal issues: Twitter's terms of service
"By submitting, posting or displaying Content on or through
the Services, you grant us a worldwide, non-exclusive,
royalty-free license (with the right to sublicense) to use,
copy, reproduce, process, adapt, modify, publish, transmit,
display and distribute such Content in any and all media or
distribution methods (now known or later developed)."
"You agree that this license includes the right for Twitter to
make such Content available to other companies,
organizations or individuals who partner with Twitter for
the syndication, broadcast, distribution or publication of
such Content on other media and services, subject to our
terms and conditions for such Content use."
"We encourage and permit broad re-use of
Content. The Twitter API exists to enable this."
12. Legal issues: API rules
"You will not attempt or encourage others to: sell, rent,
lease, sublicense, redistribute, or syndicate access to the
Twitter API or Twitter Content to any third party without
prior written approval from Twitter. If you provide an API
that returns Twitter data, you may only return IDs (including
tweet IDs and user IDs).You may export or extract non-
programmatic, GUI-driven Twitter Content as a PDF or
spreadsheet by using "save as" or similar functionality.
Exporting Twitter Content to a datastore as a service or
other cloud based service, however, is not permitted."
"Except as permitted through the Services (or these Terms),
you have to use the Twitter API if you want to reproduce,
modify, create derivative works, distribute, sell, transfer,
publicly display, publicly perform, transmit, or otherwise use
the Content or Services."
17. Sampling approaches
Strategy #1: Sample by hashtag, keyword, user, geographical
location, or other filtering parameters
+ representativeness unclear - time frame and parameters
on multiple levels have to be carefully chosen
Strategy #2: Use the 1% or 10% sample provided by the
Streaming API
+ generally assumed to be - time frame has to be
representative (of Twitter) carefully chosen
Strategy #3: Capture Twitter's entire throughput
+ highly representative - technically very difficult/costly
(of Twitter)
18. Summary
develop a question/general direction
collect data using these or other tools
store in a database or spreadsheet (CSV)
annotate, analyze and visualize using a variety of tools
(Excel, Tableau, R, Gephi, NVIVO, ...)