This presentation was for the 2014 AAPOR conference, and deals with specific components of how "big data" from social media is different from data acquired through surveys.
AAPOR - comparing found data from social media and made data from surveys
1. "When Are Big Data Methods
Trustworthy for Social Measurement?"
Cliff Lampe (@clifflampe), Josh Pasek, Lauren Guggenheim, Fred Conrad
University of Michigan
Michael Schober
The New School for Social Research
2. Presenting on “Big Data”
• Cliff Lampe
– University of Michigan
School of Information
– Social Scientist who uses
some Big Data
techniques
– NOT A REAL DATA
SCIENTIST
– Background in survey
research
4. CHI – Computer Human Interaction
KDD – Knowledge Discovery and Data Mining
WSDM – Web Search and Data Mining
5. Ironically Data-Free Presentation
Today we are presenting on methodological issues of
Big Social Data and surveys. Not presenting new data.
First we describe Big Data and Big Social Data
as terms.
Then we describe methodological
considerations at the intersection of
surveys and Big Social Data
6. There have been many hyperbolic claims about Big Data
Is Big Data going to replace other forms of social
measurement, or is it too flawed to survive (HINT: Neither)
9. Big Data is increasingly being applied to
social science questions
10. What counts as “big”?
LHC: .001% of sensors
lead to 25 petabytes
annually.
Wikipedia: 17 terabytes
Twitter: ~ 10 GB/day
How many observations
needed to count as
“big”?
Note: 100 million records not all that big.
11. Almost nobody who
uses these techniques
would use the term
“big data”. Similar to
surveys vs. polls.
Big Data is short hand for a
variety of techniques that
include:
- Data capture
- Data storage
- Data analytics
- Search and Retrieval
12. Challenges in “Big Data”
Capture
Curation
Storage
Search
Sharing
Transfer
Analysis
Visualization
Related terms:
Computational social science, data
science, information access and
retrieval, Web-scale data, data
mining, machine learning, non-
reactive data
13. Big Social Data: large data sets about humans that are collected from
social interactions captured online, primarily in social media sites.
14. What are the characteristics of surveys and Big Social
Data that define when they are complementary,
supplementary, or orthogonal?
15. Bob Groves
“Three Eras of Survey Research”
Mick Couper
“Is the Sky Falling? New Technology, Changing
Media, and the Future of Surveys”
16. Survey Research
80+ years of research and practice
Sampling procedures
Question design
Estimating precision of statistics
Practices in reducing survey error
Attempt to represent the population of interest
with a sample
17. Research Questions
• Do we see big social data and survey data telling
us the same things about society? When and why
might this happen?
• How do survey data and big social data compare
on important dimensions?
• In what ways are the two fundamentally different
from each other?
• How are their uses different from one another?
18. Highlighting 3 Areas of Concern
How participants understand the activity of responding
or posting
Different motivations and communicative dynamics
Nature of the data
Different structure, users, and data
properties
Practical, ethical, and analytic
considerations
20. Participants’ Understanding
– Posting initiative or motivation
– Informed consent
– Ability to opt out
– Prior considerations
– User identity
– Perceived audience and social desirability
– Time pressure/synchrony
– Respondent burden
21. Participants’ Understanding
• Nature of perceived audience
– Survey: Interviewer, Organization, others in HH
– BSD: Groups of friends, acquaintances, public
• Social Desirability
– Survey: Avoid negative evaluations from
researcher
– BSD: Manage impressions for their audience
• Scale of data
• Face threatening topics
22. Participants’ Understanding
• Identity of user
– Survey: Kept anonymous
– BSD: User-created persona. Multiple users on a single
account, multiple accounts for one user, corporate
users, etc.
• Prior Considerations
– Survey: May not have thought about issue
– BSD: Have thought about it, maybe not deeply
• Being asked vs caring to post
24. Nature of the Data
– Population coverage
– Sampled units
– Sampling
– Sample size
– Temporal properties
– Relevance to research topic
– Granularity of possible analyses
– Data structure
– Auxiliary information
25. Nature of the Data
• Sampling
– Surveys: Representative of population of interest (via probability
sampling)
– BSD: Users/messages not the full population. User accounts are
not always users. Frequency of posting among users varies
• Sample Size
– Surveys: Balance between large enough to make inference and
low cost
– BSD: More users and posts than surveys. Limited by
access/storage.
• Can size help overcome sampling/representativeness problems?
• The aggregation of SM does not necessarily map on to collection of
individual users in survey research
26. Nature of the Data
• Temporal properties:
– Surveys: Memory retrieval, measurement at
discrete moments
– BSD: Posting on recent events, continuously
• Auxiliary data:
– Surveys: Paradata (# calls, behavior during
interview)
– BSD: Geolocation, system activity, profile info
28. Practical, Ethical, and Analytic
Considerations
– Established research communities
– Consent to research/IRB
– Perception of research among public
– Costs to researchers
– Data ownership
– Adjustments for non-representativeness
– Stability of data source and adjustments
– Updating models in changing environment
– Users and impact
29. Practical Considerations
• Adjustments for non-representativeness
– Surveys: Well developed, weighting
– BSD: No standard use, depends on style of analysis,
may not be done if using certain techniques
• Ethical issues
– Surveys: Explicit consent, regulated by govm’t/IRB
– BSD: Unaware of terms in user agreement,
inconsistently regulated by IRBs
30. Practical Considerations
• Perception of research/Legitimacy
– Surveys: fatigue, falling response rates, confusion
about legitimacy
– BSD: not considered while posting, but concerns
over surveillance
32. Conclusion
We need to stop arguing
about the wrong things.
We need a systematic
agenda of research
looking at the
intersection of these
methods.
socialmediasurveys@umich.edu
cacl@umich.edu
Twitter: @clifflampe
Editor's Notes
How participants understand the activity of responding or posting
Different motivations and communicative dynamics
Nature of the data
Different structure, users, and data properties
Practical, ethical, and analytic considerations
How participants understand the activity of responding or posting
Different motivations and communicative dynamics
Nature of the data
Different structure, users, and data properties
Practical, ethical, and analytic considerations