Seventh lecture of the course CSS01: Introduction to Computational Social Science at the University of Helsinki, Spring 2015.(http://blogs.helsinki.fi/computationalsocialscience/).
Lecturer: Lauri Eloranta
Questions & Comments: https://twitter.com/laurieloranta
5. • Big promises vs. Big challenges
• The research subjects are humans
• The massive amounts of data are gathered from human based interactions
• This underlines challenges in:
• Research Ethics
• Privacy
• Transparency
• Trust
• Research Method: how to design and conduct research in an ethical
manner?
• Access to data
• Who owns the research data?
• Do you have access to research data?
• Purpose of Research (agenda)
COMPUTATIONALSOCIAL
SCIENCE IS PROBLEMATIC
6. King, G. 2011. Ensuring the Data-Rich Future of the Social Sciences. Science. 11 February 2011:
Vol. 331 no. 6018 pp. 719-721.
8. 1. Big data changes the definition of knowledge
2. Claims to big data objectivity and accuracy are misleading
3. Bigger data are not always better data
4. Taken out of context big data loses its meaning
5. Accessibility does not make big data research ethical
6. Limited access to big data creates new digital divides
CRITICALQUESTIONS FOR
BIG DATA(BOYD&CRAWFORD2012)
9. • “Big Data has emerged a system of knowledge that is already changing
the objects of knowledge, while also having the power to inform how we
understand human networks and community. ‘Change the instruments,
and you will change the entire social theory that goes with them’,
Latour (2009) reminds us.” (Boyd & Crawford 2012)
• “Rather, it is a profound change at the levels of epistemology and
ethics. Big Data reframes key questions about the constitution of
knowledge, the processes of research, how we should engage with
information, and the nature and the categorization of reality.”
• Do numbers really speak for themselves?
• The inherent bias of the tools an technologies!
1.BIGDATACHANGESTHE
DEFINITIONOFKNOWLEDGE
(Boyd & Crawford 2012)
10. • “In reality, working with Big Data is still subjective, and what it quantifies
does not necessarily have a closer claim on objective truth”
• There’s a risk that big data widens the division between “subjective”
qualitative research and “objective” quantitative research
• Processing and analyzing big data contains quite many subjective
steps that sometimes are not recognized subjective
• How data is cleaned
• What methods of analysis are used and how
• How results are interpreted
• The reliability of data sets?
• Errors in data sets
• Transparency on how the data set is collected is typically very limited!
• Biases and limitations of data set
2.BIGDATAISNOTTHAT
OBJECTIVE
(Boyd & Crawford 2012)
11. • Just because big data presents us with large quantities of data does not
mean that methodological issues are no longer relevant. Understanding
sample, for example, is more important now than ever.
• Validity
• Reliability
• Fit for research question?
• Good example of sample limitations and bias is Twitter data
• Does not represent “all people” even though millions of people might
be included in the data set
• No visibility on the sample selection of the data set
• Size does not equal representability
• Restricted access to Twitter firehose, garden hose etc…
3.BIGGERDATA ARENOTALWAYS
BETTERDATA
(Boyd & Crawford 2012)
12. • Data related tools and methods might not be transferable from context to
context
• E.g. Facebook graph might mean something in Facebook, but it is
hardly the full representation of the persons real life social network
• Activity and intensity in social media context might not have the same
meaning in real life
• Big data is not generic data about social interactions in general, but
specific to the source it is collected from
4.TAKENOUTOFCONTEXT,BIG
DATALOSESITSMEANING
(Boyd & Crawford 2012)
13. • “[W]hat is the status of so-called ‘public’ data on social media sites? Can it
simply be used, without requesting permission? What constitutes best ethical
practice for researchers? Privacy campaigners already see this as a key battleground
where better privacy protections are needed. The difficulty is that privacy breaches
are hard to make specific – is there damage done at the time? What about 20 years
hence? ‘Any data on human subjects inevitably raise privacy issues, and the real
risks of abuse of such data are difficult to quantify’ (Nature, cited in Berry 2011).”
• Open access to data does not mean that the research is automatically ethical.
• Understanding of processes of mining and anonymizing Big Data are typically limited:
true accountability requires critical thinking even in cases where some ethical board
have granted access for research
• Significant questions in relation to control and power: researchers have the tools and
the access, while social media users as a whole do not.
5.JUSTBECAUSEITISACCESSIBLE
DOESNOTMAKEITETHICAL
(Boyd & Crawford 2012)
14. • “But who gets access? For what purposes? In what contexts? And with what
constraints? While the explosion of research using data sets from social
media sources would suggest that access is straightforward, it is anything
but. “
• Only Social Media companies have full access to data, an average
scholar does not.
• Access to data typically costs creates uneven opportunities for research
• Top tier universities are in better position
• Skills required for accessing data are restricted to those with computational
background
• This can be also seen as a gendered division
• Limited access creates a huge bias in relation to the questions asked
• Who get’s to decide the purposes big data is used
6.LIMITEDACCESSTOBIGDATA
CREATESNEWDIGITALDIVIDES
(Boyd & Crawford 2012)
15. • Current ethical protocols are not adequate for the types of digital social research
increasingly being conducted.
• Information generated by users of social media platforms and services cannot be considered equivalent to
conventional types of offline information collected by social researchers.
• Challenges according Neuhaus & Webmoor (2012):
1. Change in the enactment of the participant and researcher relationship
(computer mediated setting where this relationship is mediated)
2. Number of individuals in one research data set has sky rocketed, but so has
the privacy / accountability risks
3. Problems of identity in relation to research “participants” and “research
data”. What roles do these actors actually play.
4. Collected data may reveal user’s identities after remixing with other data
points, even when the original research dataset was anonymized
5. Peer reviews and accountability might be at stake because nowadays a single
researcher has access to millions and millions of data points previously
accessible only by teams of researchers.
BIG DATARESEARCH
SETUPCHALLENGES
(Neuhaus & Webmoor 2012)
16. • Neuhaus and Webmoor (2012) propose agile ethics for big data research:
• Researchers and institutions should accept the fact that this kind of large-scale data mining
still involves human subjects.
• Logging of research activities and big data collection
• As contract between researchers and participants is not possible, we need to place data
generation on more of an equal footing with final outputs; to think of it in terms of
authorship.
• Taking responsibility of the data sets
• Agile ethics is more an attitude, or a mode of engagement and sensibility for good practice,
as opposed to a formal list of procedures and protocols
• Flexibility is integral to agile research: considering case by case
• An agile ethics makes the counterintuitive move to increased openness and transparency;
to expose our-selves equally with those wrapped up in our projects.
AGILE ETHICS IN BIG DATA
RESEARCH
(Neuhaus & Webmoor 2012)
17. • The power is inherently relational between the following stakeholders:
• Big Data Collectors: decide which data is collected, stored and for how long.
Deciding who gets access.
• Big Data Utilizers: uses and redefines the use of data. Can be both collector &
utilizer. Determining new behaviour by imposing new social rules of manipulating
social processes.
• Big Data Generators:
• Natural actors, that generate massive amounts of new data voluntarily,
unvoluntarily, knowingly, unknowingly…
• Artificial actors
• Physical phenomena
• In this power network ethical decision making is no longer a agency based activity but
relational network based ethics
NEWPOWERDISTRIBUTION &
NETWORKEDETHICS
(Zwitter 2014)
18. • “Big data poses big privacy risks. The harvesting of large sets of
personal data and the use of state of the art analytics implicate
growing privacy concerns. Protecting privacy will become harder
as information is multiplied and shared ever more widely among
multiple parties around the world.“ (Tene & Polonetsky 2014)
• Big data threatens privacy and democracy
• Incremental Effect: the growing potential of user identification with more
and more data
• Automated decision making based on data and questions of
discrimination and the narrowing of choice
• Predictive analysis based on sensitive individual information
• Lack of access and exclusion: only a few benefit from big data and have
access to in vast amounts
• Problems with research ethics
• Chilling effects of the surveillance society as people change their
behaviour based on the notion of 24/7 monitoring
BIG CONCERNS ON
PRIVACY
(Tene & Polonetsky 2014)
19. • Key thing to consider in any computational social science study is how to
protect the privacy of individuals and groups that are research subjects
• Research data needs to be anonymized in some way
• Unfortunately this can be quite hard, as in data sets with many
data points the data can be connected to the individual, even from
anonymous data
• Also critical issue is group privacy in the sense, that although the
individual level data might be non-personal, the group level aggregated
data might reveal something “private” from the group
PROTECTINGTHE PRIVACYOF
THE RESEARCHSUBJECT
(Zwitter 2014)
22. 1. Legalities and rights concerning the normal use of software, services
and data: What have the research subjects agreed on?
2. Legalities and rights concerning the research use of software,
services and data: What is allowed for research and what have you
agreed on as a researcher?
3. Legalities and rights concerning the distribution of your own work
(code + data): How can I distribute this in a way that it benefits the
society the most?
THREE LEGALAREASTO
UNDERSTAND
23. • Database and software are typically protected by copyrights (or similar
rights) and their usage are regulated via database and software licenses.
• Protection for databases vary from country to country. European Union
has a special database rights that protect each database for 15 years.
• For normal copyright this is the lifetime of the author +70 years. This
applies to all software.
• In order to use the database or software, a license for the use is needed:
• Agree with the terms of service
• Agree with the license
RIGHTS & LICENSES
24. • EULA: End user license agreement. Typically in distributed and installed
software and apps. Include also asking permissions for end user data
collection and processing. (Wikipedia 2015, End-user license
agreement)
• Terms of service: “The Terms-of-Service Agreement, is mainly used for
legal purposes, by websites and internet service providers, that store a
user's personal data, such as e-commerce and social networking
services. A legitimate terms-of-service agreement, is legally binding, and
may be subject to change.” (Wikipedia 2015, Terms of service)
END USERAGREEMENTS
25. • User rights and responsibilities
• Proper or expected usage; potential misuse
• Accountability for online actions, behavior, and conduct
• Privacy policy outlining the use of personal data
• Payment details such as membership or subscription fees, etc.
• Opt-out policy describing procedure for account termination, if available
• Disclaimer/Limitation of Liability clarifying the site's legal liability for
damages incurred by users
• User notification upon modification of terms, if offered
ITEMS INATYPICALTERMS
OF SERVICE
(Wikipedia 2015, Terms of service)
27. • Terms of service also govern what one is able to do with the service as a
users.
• In many cases a researchers is a user in this respect: thus terms of
service may define what and how one is able to research
• E.g. Is web-scraping allowed?
• E.g. How much information the user is able to get via an API
• As researcher needs to agree with the terms of service to conduct the
research, there might be legal consequences if service terms are
breached
• Highly important to read and understand the legal agreements in
relation to one’s research
TERMSOFSERVICEGOVERN
ALSOUSEOFDATA(&RESEARCH)
28. • When using open source software and/or sharing your code, it is important
to understand under which software license this is done
• There are differences between different open source licenses with big
implications (in general all allow license cost free modification, copying and
distribution).
• Two major types of open source software licenses:
1. Permissive free software licenses
2. Copyleft licenses
• In addition there is the Creative Commons (CC) license family, which is
more general and extends to many other areas than software. Open
Databases are typically licensed under CC, or CC0 public domain.
• The Open Knowledge Foundation is also promoting Open Database
License (ODbL)
OPEN SOURCE LICENSES
29. • Give rights to use, modify and distribute the software and do not limit the
potential further use of the software.
• Permissive: The further distribution of the software may or may not
be free of charge
• Gives permissions to do anything freely
• Typically requires crediting the original authors
• Can be seen as “the academic” license. Most well know versions are
from MIT and Berkeley licenses
• MIT License
• BSD License
PERMISSIVE OPEN
SOURCE LICENSES
30. • Copyleft is the practice of offering people the right to freely distribute
copies and modified versions of a work with the stipulation that the same
rights be preserved in derivative works down the line. (Wikipedia,
Copyleft)
• Software done based on copyleft software is automatically under
copyleft license. (It can be seen as contagious in this sense)
• Most well known copyleft licenses are GNU GPL and its versions
COPYLEFTOPEN SOURCE
LICENSES
31. • “Works in the public domain are those whose intellectual property rights
have expired, have been forfeited, or are inapplicable. Examples include
the works of Shakespeare and Beethoven, most of the early silent films, the
formulae of Newtonian physics, Serpent encryption algorithm and powered
flight.” (Wikipedia 2015, Public Domain)
• Getting things to public domain can be quite hard: some countries may even
prohibit any attempt by copyright owners to surrender rights automatically
conferred by law.
• An alternative way: issue a license which irrevocably grants as many rights
as possible to the general public. CC0 license from Creative Commons
PUBLIC DOMAIN
(Wikipedia 2015, Public Domain)
32. • “The Data Protection Directive (officially Directive 95/46/EC on the protection
of individuals with regard to the processing of personal data and on the free
movement of such data) is a European Union directive adopted in 1995
which regulates the processing of personal data within the European Union.
It is an important component of EU privacy and human rights law. On 25
January 2012, the European Commission unveiled a draft European
General Data Protection Regulation that will supersede the Data Protection
Directive.” (Wikipedia 2015, Data Protection Directive)
• Governs the processing and transfer of personal data
• Introduced the right to be forgotten
• The U.S. has no single data protection law, and legislation is on ad hoc
basis
EUROPEAN DATA
PROTECTION DIRECTIVE
33. • Read Instagram’s latest Terms of Use, Privacy Policy and API Terms
of Use:
https://instagram.com/about/legal/terms/
• What implications does the terms have in relation to potential
research that uses Instagram pictures as research data?
LECTUREASSIGNMENT1
34. • Watch the following videos on big data & privacy:
• https://www.youtube.com/watch?v=H_pqhMO3ZSY
• Read the following articles on ethics, surveillance and big data:
• Zwitter, A. (2014). Big Data ethics. Big Data & Society, 1(2),
2053951714559253.
• Lyon, D. (2014). Surveillance, Snowden, and big data: capacities,
consequences, critique. Big Data & Society, 1(2), 2053951714541861.
LECTUREASSIGNMENT2
35. • Boyd, D., & Crawford, K. (2012). Critical questions for big data:
Provocations for a cultural, technological, and scholarly phenomenon.
Information, communication & society, 15(5), 662-679.
• Zwitter, A. (2014). Big Data ethics. Big Data & Society, 1(2),
2053951714559253.
• Richards, N. M., & King, J. H. (2014). Big data ethics. Wake Forest Law
Review.
• Neuhaus, F., & Webmoor, T. (2012). Agile ethics for massified research
and visualization. Information, Communication & Society, 15(1), 43-65.
• Lyon, D. (2014). Surveillance, snowden, and big data: capacities,
consequences, critique. Big Data & Society, 1(2), 2053951714541861.
LECTURE 7 READING
36. • Boyd, D., & Crawford, K. (2012). Critical questions for big data: Provocations
for a cultural, technological, and scholarly phenomenon. Information,
communication & society, 15(5), 662-679.
• Zwitter, A. (2014). Big Data ethics. Big Data & Society, 1(2),
2053951714559253.
• Richards, N. M., & King, J. H. (2014). Big data ethics. Wake Forest Law
Review.
• Bollier, D., & Firestone, C. M. (2010). The promise and peril of big data (p.
56). Washington, DC, USA: Aspen Institute, Communications and Society
Program.
• Tene, O., & Polonetsky, J. (2012). Big data for all: Privacy and user control in
the age of analytics. Nw. J. Tech. & Intell. Prop., 11, xxvii.
• Neuhaus, F., & Webmoor, T. (2012). Agile ethics for massified research and
visualization. Information, Communication & Society, 15(1), 43-65.
• Lyon, D. (2014). Surveillance, snowden, and big data: capacities,
consequences, critique. Big Data & Society, 1(2), 2053951714541861.
• Hutton, L., & Henderson, T. (2013). An architecture for ethical and privacy-
sensitive social network experiments. ACM SIGMETRICS Performance
Evaluation Review, 40(4), 90-95.
REFERENCES
Great promises, but there are still many problems to be tackled in order to reach full potential
-Research side
-Biased research because of the access
-Society side
King, G. 2011. Ensuring the Data-Rich Future of the Social Sciences. Science. 11 February 2011: Vol. 331 no. 6018 pp. 719-721.
The change in research methods may change the understanding of what is knowledge. Bias towards quantitative underlining, where numbers tell everything.
Computational Social Science methods contain many process and method related decisions that are quite subjective. Even though they typically handle numbers does not make it more objectivvce.
-Big amounts of data might also have big amounts of errors
-The collection of data (that might be unknown) is quite subjective (what is collected and what for)
-The importance of data sampling and reasoning behind your data are not less important even though there are big amounts of data available
-The understanding of the sample is as imkportant, for validity and realibilyt
-bias in questions
-Handling research subject privacy may have technical implications on the research design setup
Hutton & Henderson have proposed a tehcnical architecthure (PRISONERS) for handling social network study privacy matters