Studying archives of online behavior

Studying Archives of Online Behavior
Computational Qualitative Research Seminar
James Howison
University of Texas at Austin
Link to slides on twitter @jameshowison
Readings at
https://www.dropbox.com/sh/1gx9s2zlnxvumbz/AAAV9uSAJHsiPeJ
hSsNnnM9Pa?dl=0

Readings
• The presentation and discussion will draw on:
– Howison, J., & Crowston, K. (2014). Collaboration through open superposition: A theory of the
open source way. MIS Quarterly, 38(1), 29–50.
– Howison, J., Wiggins, A., & Crowston, K. (2011). Validity Issues in the Use of Social Network
Analysis with Digital Trace Data. Journal of the Association for Information Systems, 12(12),
Article 2.
– Geiger, R. S., & Ribes, D. (2011). Trace Ethnography: Following Coordination through
Documentary Practices. In Proceedings of the 44th Hawaii International Conference on System
Sciences (HICSS 2011) (pp. 1–10). Waikoloa, HI. http://doi.org/10.1109/HICSS.2011.455
– Annabi, H., Crowston, K., & Heckman, R. (2008). Depicting What Really Matters: Using
Episodes to Study Latent Phenomenon. In Proceedings of the International Conference on
Information Systems (ICIS).
– The methodological appendix for the Howison and Crowston Superposition article.

To the Archives!
The evidence is here, somewhere.
CC Credit:
http://www.flickr.com/
photos/hamadryades/

Opportunities of online archive studies
• Quantity
• Granularity
• Accessibility
– Much is openly available
– Or the organization can provide bulk access
– (compare to ethnography and getting individual
cooperation)
• Emic'ness

Emic'ness?
Emic: in their words (from the inside)
Etic: in your words (from the outside)
Naturalistic: the archives are primary to the
users and the activity themselves:
"documentary traces are the primary mechanism in
which users themselves know their distributed
communities and act within them.”
(Ribes and Geiger, 2011)

Yet, many challenges
We are using the system (and the system that
archived and presents the traces) as a data
collection method.
But the systems were not built for research.
So we need to ask, for any research question:
How well do the archives represent the activity,
as it happened?

Individual Exercise (6 mins)
1. Pick a system that renders online archives of
something you are interested in.
– Can be your project for this course or something you
choose right now.
– Slight preference for an archive showing traces from
more than 1 person
2. Go and find a specific archive page and read it.
3. Write a sentence or two about what is
happening there.

Quick Group discussion (4 mins)
• Let’s hear from a few participants about their
choices.

Individual exercise II (6 mins)
• How might archives diverge from experience?
1. How did the system record activity at the time?
2. How did the conversion to archives occur?
3. How is your experience of reading the archives
different from the experience of the participants
in the activity that was archived?

Discussion in groups
• Group discuss questions (go question by
question, not person by person)
– How recorded? (each person speak)
– How converted? …
– How is reading experience different?

Most surprising?
• One person from each group report back
aspect that was most surprising.

Archival transformation
• Deletions
– Some data is periodically purged from databases, after all
they are running a website, not a research database.
• Overlaps
– When database dumps are pulled periodically
• Re-calculations
– Historical depictions on a site (e.g., counts of messages,
members, or other data such as downloads) might be later
creations or re-calculations
– Can you rely on participants having seen those figures at
the time?

Database schemas
are not research ontologies
• Databases (or websites) often use words that are
very exciting for research
– “Friends”, “Followers”, “Assignment”, “Member”
• But their meaning may have very, very little to do
with the sociological/theoretical concept
– At best they are a hint that something interesting is
happening, but often are interpreted literally!
• Examples from Sourceforge
– use of “assigned to” field on close.
– “member list” does not show who is active (no one
was ever removed!)

Non-archived activity
PublicPrivate
Errors
Warnings
Code
Local
Binary
Application
Logs
Stack
Dumps
Variables
Stacks
Commit
Log
Annoucement
Email
Discussion
Emails
Release
Notes
Bug
report
Discussion
of Bug
Bug
Repository
Testing
Builds
from CVS
Public
Release
CVS
Check-in
Coding
Compiling
Debugging
Public
Release
Binary
Release
Source
Release
Private Public
CVS
Check-out
Local
testing

Reasoning with missing/complete data
• Trouble both ways
• Assuming that the data are complete (rather
than a system selected sample)
• Can miss important activities or whole archives that
need to be integrated.
• Oddly enough, when data are complete issues
can also emerge
– See discussion in JASIST validity in SNA paper.

Hidden readership
• Archives almost never tell you who read what,
and when they read it.
– Might be key to interpretation (or might be
irrelevant)
– Definitely crucial to any argument about
information flow (and almost all interpretations of
SNA measures are about information flow).
• You may be able to impute readership from
responses, but it’s a weak signal.

Activity traces scattered
through archives
• Participants experience a flow of activities
across different systems
– Linked by time and order that they occur
• But they are archived by different systems
– If you just read the mailing list you miss so much
– And yet so many studies *want* their archive to
be the only one (so much easier to analyze).

Release Notes Dev Email Bug Tracker RFE TrackerUser Forum
TaskOutcome
Task
Relevant
Documents
TaskOutcome
Task
Relevant
Documents
TaskOutcome
Task
Relevant
Documents
CVS
Search and assign
Relevant Documents

Pacing of activities
• Participant observation in an open source
project highlighted the role of pacing.
– Rapid replies indicated interest and importance
but also availability
– Very long gaps (sometimes years) indicated
deferral and return.
• In other work I was reading archives and
found pacing hard to appreciate; it was very
salient in participant observation but hidden
in studies relying on trace data alone.

How to represent pacing?
Time stamps

Representing pacing
• Calculate gaps?

Reading gaps doesn’t help, easy to ignore,
make them harder to ignore?

What is to be done?
• Sufficient engagement with the system and community
to adequately interpret the traces.
• Use a system and see how your data is archived.
• When you think a phenomena/construct can be
operationalized computationally, at least show some
narrative examples from the dataset.
• Complement archives with interviews and/or surveys
– Archives make great prompts for interviews
– Lakhani and Wolf (2003) survey immediately after a post.
• Gaskin et al (2014) “Zooming in and out of
sociomaterial routines” MISQ.

An ontology for trace data studies
• Document
– Archived content. E.g., An e-mail message, tracker comment, release note, pull-request,
log entry.
– Provides evidence for events and actions.
– One document may provide evidence for multiple events and actions.
• Event
– An event causes documents to be archived. Sending an email, releasing a
version.
• Action
– The contextualized meaning of an event. e.g., contributing code, showing
leadership (can be at quite different conceptual levels in different studies.)
• Participant
– An actor (typically a person, but could be a machine or bot)
• Identifier
– A string associated with a participant.
– Many identifiers could refer to one participant (e.g, email and username)
– but many participants may act through one identifier (e.g., “admin account”)

Episodes
• A unit of analysis, facilitating comparison and
summary (e.g., counting)
– Compare to content analysis or nlp that counts
mentions of concepts, database queries that
count documents, surveys that measure attitudes.
– The detail provided by trace data renders episodes
more accessible, research to be more granular,
closer to the work.
• Ideally emic (meaningful to and recognizable
by participants)

Ok, but how to store this?
• Moving from documents and events to actions
and outcomes is interpretative work
– I do the qualitative first, then hope to make it
computable (e.g, through machine learning)
• It is akin to content analysis but a much more
complicated ontology
– Content analysis (classic or grounded theory) assigns
Codes to Documents
– Software like Atlas ti has trouble handling coding of
structured data (dates, linked documents like threads,
multiple identifiers for single participant.).

I use RDF
• Resource Description Format
– Triples: James hasEmail james@howison.name
– URLs working natively (making viewing original archives easy)
• Retains original data structure
– e.g., Document in thread by Identifier
– Allows ad-hoc addition of structure (schemaless)
– Allows inheritance (e.g., MailingListEvent a
CommunicationEvent)
• Allows you to overlay higher level structure
– e.g., Action(s) in (ordered) Episode by Participant
– And then apply codes to Actions (storing when, who, why)
• Querying via SPARQL, Validation via RDF rules (aka SPIN)

Studying archives of online behavior

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Studying archives of online behavior

Similaire à Studying archives of online behavior (20)

Plus de James Howison

Plus de James Howison (16)

Dernier

Dernier (20)

Studying archives of online behavior