I'm going to start this talk with some narrow issues—ones specifically related to research using data from Twitter, the social networking and microblogging site—and expand from them to discuss more general concerns that affect how humanities researchers approach digital resources as a part of our research practices.\n
Humanities researchers are generally used to discussing data that is publicly available to other researchers, either in published works or archives. That’s one of the reasons why Twitter seems like a fantastic resource for researchers. \n
From the outside, Twitter seems like a researcher's dream: millions of posts by actual users that reflected their observations in real time, with piles of interesting metadata, like location of posts, links to outside resources that readers are referring to, and the usernames of individuals that they are conversing with. However, analyzing this data presented some significant problems. \n
My last major research project focused on how the use of Twitter provided a new model for thinking about the classical rhetorical concerns of memory and delivery. My study focused on three data sets: the tweets of a Congressman during a mini crisis (of his own making), the tweets sent by attendees of a technology conference panel, and all the tweets referencing the 2009 health care debate over a three day period. \n
Despite the sense that Twitter is open in the ways I have discussed, the most pressing problem I faced early on was the difficulty in actually retrieving data from the site. I retrieved the data for these case studies from two sources: Twitter's search engine (search.twitter.com), and the Twitter API (the API is a way that application makers provide access to their data for third parties), both directly and through a third party Twitter archiving application built on the API. \n
Each of these methods provided their own challenges. In the case of Twitter search, I soon discovered that the site only preserves at most 3 months worth of tweets, give or take, for searching. For the first case study, I decided on this case study within that window, making it possible to use the search engine to find all of the tweets sent my the subject of the study as well as all of the tweets referencing him. \n\nHowever, by the time I began to evaluate this data, the original tweets were no longer accessible via search. This led to a situation where the tweets of one prominent commentator who tried to interact with the subject of my study, but had misspelled that subject's user name, were not included in my data set, and it was impossible for me to retrieve them.\n
The Twitter API, which was the source of the data in the latter cases, proved to be a superior means of accessing this data, but it presented its own challenges. Twitter manages access to its data through the API, allowing some developers limited access, while giving others complete access to all tweets (what is called the "fire hose"). In the past year, Twitter changed the terms of service disallowing many uses of the API that have been used by researchers to collect Twitter data. \n
While some researchers have been given access by Twitter to archives, this is certainly not a solution for everyone. Without access to the fire hose, researchers can't be sure that they are receiving all of the messages related to their topic—username, hashtag—or how much they are missing. Services like TwapperKeeper, which allowed users to create searches on the site that would then be archived by TwapperKeeper for future access, have had their access limited, thus eliminating their usefulness as archives of Twitter data.\n
While Twitter data is public, it’s data isn’t very accessible to researchers, and this forces them to create their own archives of Twitter information. For the reasons explained above, it can be nearly impossible for other researchers to recreate data sets, and not all researchers have equal access to Twitter’s output.\n
So, if we zoom out from these issues with Twitter, what do they tell us about the problems associated with digital research in general?\n
How our society deals with information has changed in some fundamental ways. With much of the data that exists online, the problem is not that the data is unavailable, but rather that it is unavailable to individuals who aren’t part of that infrastructure, and the cultural institutions that have traditionally guaranteed availability of information—public and private libraries, collections, and archives—have been slow to update their resources or simply unable to do so.\n
As social media and other digital repositories increasingly replace the kinds of self-reporting that we are familiar with—diaries, personal correspondence, financial and medical records—with digital facsimiles, or, occasionally, something entirely new, like TV diaries or foursquare. Yet it is crucial that with these technologies, the primary data is not held by the collector—that is, the person who is recording their check ins, or noting their media-viewing behavior—but rather by a for-profit corporation or third party that can 1) do whatever it wants with it (depending on the details of EULAs), 2) disappear completely taking the data with it.\n
For example, what service provides libraries with subscriptions to social media data the way that Lexis-Nexis provides news data? And, while the LOC has acquired the Twitter archive, it isn't yet clear how that archive will be updated as Twitter continues to grow or accessed by the public.\n
We are quickly approaching the day when an author or other person of interest will sell or donate their personal effects to a library or archive, and, instead of arranging for the delivery of boxes of papers and other physical artifacts will simply surrender her or his Google log-in information, and with it a personal search history, collection of documents, a record of reading and viewing habits, email messages, medical data, and so on.\n
The situation is somewhat more fraught when one considers third party applications that piggy-back on the data of larger companions, as is the case with social media services like Twitter. You may note that a theme running through the examples of access with Twitter is that the major access points are all run through Twitter. Twitter controls search and its API. Twitter has a right to its data, but it's our data, too. And when for-profit companies hold our archives—our memories—that presents problems of a cultural and historical nature. When our scholarship is dependent on these same companies, it becomes necessary for us to rethink how we not only cite material, but how we make that material available for other researchers. That is, we must think about becoming an archive unto ourselves.\n
Humanities researchers may want to look at how researchers in the sciences conceive of data and make it public. In doing so, we can not only learn from their practices—such as collecting and maintaining our own data—but also build on them, but making that data open and not proprietary. The NEH has encouraged the examination of data and creation of archives, and libraries and other archives are beginning to catch up with digital archives. I don’t have a lot of answers as to how we can manage this moment, other than to suggest that it should be open.\n