1. OPEN DATA:
ENHANCING
PRESERVATION,
REPRODUCIBILITY, AND
INNOVATION
Clarke Iakovakis
Scholarly Communications Librarian
Neumann Library CC BY-SA 3.0-2.5-2.0-1.0 image courtesy Daniel Tenerife - Own work.
Title: "Social Red"
https://commons.wikimedia.org/wiki/File:Social_Red.jpg#mediaviewer/Fil
e:Social_Red.jpg
This work is licensed under a Creative Commons Attribution-
NonCommercial-ShareAlike 4.0 International License.
2. CHANGE IN MINDSET
“data is no longer regarded as static or stale,
whose usefulness is finished once the purpose for
which it was collected was achieved.”
- Kenneth Cukier and Viktor Mayer-Schönberger
"in some fields, the data are coming to be viewed
as an essential end product of research,
comparable in value to journal articles or
conference papers”
- Christine Borgman
3. OUTLINE
• Data-centric scholarship
• Benefits & challenges of open data
• Defining open data
• Reproducibility
• Public use & data management plans
• Data reuse
• Concerns and open questions
• Where to deposit data?
5. WHAT WE MEAN BY “DATA”
A wide definition:
any information that can be stored in digital form,
including text, numbers, images, video or movies,
audio, software, algorithms, equations,
animations, models, simulations, etc. Such data
may be generated by...observation, computation,
or experiment
- National Science Board
National Science Board. Long-Lived Data Collections: Enabling Research and Education in the 21st Century. Arlington, VA (2005): 13.
https://www.nsf.gov/pubs/2005/nsb0540/nsb0540_3.pdf
6. WHAT IS RESEARCH DATA?
collected, observed, accessed, or created, for the
purposes of analysis to produce and validate original
research results.
What is a routine collection at one point can
become research data in the future
Thus research data are very much about when
they are used, as well as what they constitute, and
the purpose for which they are to be used
University of Edinburg. “Research Data Explained.” http://mantra.edina.ac.uk/researchdataexplained/
7. Hard Science: Scientific data generated by
instrumented research projects
Social science: data generated from government
statistics, online surveys, behavioral models
Humanities: bodies of text, digital images and
video, models of historic sites
WHAT IS RESEARCH DATA?
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
8. Applying information technology to research problems
Collaborations across disciplines & increasing size of
collaborations
Increasing the complexity and quantity of research
data
DATA INTENSIVE RESEARCH
9. DATA INTENSIVE RESEARCH
• Scientific instruments generate data at greater
speeds, densities, and detail
• Digitization of older print & analog data
• Born digital data
• Data storage capacity increases & storage
costs decrease, enabling preservation of data
• Improvements in searching, analysis &
visualization tools
10. World’s technological installed capacity to store information (table SA1) (16).
M Hilbert, and P López Science 2011;332:60-65
11. SLOAN DIGITAL
SKY SURVEY
The most distant quasar ever discovered (at least as of
October 2003). The redshift 6.4 quasar is seen at a
time when the universe was just 800 million years old.
The light-travel time from this object to us is about 13
billion years.
http://www.sdss.org
13. VALUE OF DATA
Pryor, Graham. “Why Manage Research Data?” Managing Research Data. London: Facet Publishing, 2012.
14. VALUE OF DATA
Value of a dataset can be
• Immediate
• Gained over time
• Transient
• Little (i.e. it’s easier to recreate than curate)
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
15. VALUE OF DATA
“Fundamentally, there is a shift from a
document-centric view of scholarship to a data-
centric view of scholarship”
- Sayeed Choudury
Choudury, Sayeed. "Data curation: An ecological perspective." College & Research Libraries News 71, no. 4 (2010): 194-196.
16. WHY OPEN?
Data that underpin a journal article should be
made concurrently available in an accessible
database.
We are now on the brink of an achievable aim:
for all science literature to be online, for all of
the data to be online and for the two to be
interoperable.
Adapted from Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The Open Research Challenge: Peer Review and Publication of Research Data“ Licensed under CC BY
Royal Society June 2012, Science as an Open Enterprise,
http://royalsociety.org/policy/projects/science-public-
enterprise/report/
17. DATA AVAILABILITY
Vines, Timothy H, Arianne Y K. Albert, Rose L Andrew, Florence Débarre, Dan G Bock, Michelle T Franklin, Kimberly J Gilbert, et al. "The Availability of Research Data Declines Rapidly
with Article Age." Current Biology 24, no. 1 (1/6/ 2014): 94-97. https://linkinghub.elsevier.com/retrieve/pii/S0960-9822(13)01400-0
Researchers requested data sets from a relatively
homogenous set of 516 articles published 1991-2011 in field
of zoology
Tracking down the authors & getting a response was the first
challenge.
For every yearly increase in article age, the odds of the data
set being reported as extant decreased by 17%
When the authors did give the status of their data, the
proportion of data sets that still existed dropped from 100%
in 2011 to 33% in 1991
18. DATA AVAILABILITY
Vines, Timothy H, Arianne Y K. Albert, Rose L Andrew, Florence Débarre, Dan G Bock, Michelle T Franklin, Kimberly J Gilbert, et al. "The Availability of Research Data Declines Rapidly
with Article Age." Current Biology 24, no. 1 (1/6/ 2014): 94-97. https://linkinghub.elsevier.com/retrieve/pii/S0960-9822(13)01400-0
Many of these missing data sets could be
retrieved only with considerable effort by the
authors, and others are completely lost to
science
19. DATA LOSS
Adapted from Mitcham, Jenny & Lindsey Myers. “Managing your research data”. Licensed under CC BY-NC-SA
20. Adapted from Mitcham, Jenny & Lindsey Myers. “Managing your research data”. Licensed under CC BY-NC-SA
DATA LOSS
21. DATA LOSS
• Human error
• Natural disaster
• Facilities infrastructure
failure
• Storage failure
• Server
hardware/software
failure
• Application software
failure
• Format obsolescence
• Legal encumbrance
• Malicious attack
• Loss of staffing
competencies
• Loss of institutional
commitment
• Loss of financial stability
Peters, Christie. Research Data Management: Basics and Best Practices.
http://uknowledge.uky.edu/cgi/viewcontent.cgi?article=1000&context=rdsc_workshops. Licensed under CC BY
22. DISCUSSION
• Have you seen a shift to a data-centric
research culture in your discipline?
• Is data availability a concern among you or
your colleagues?
• Other ideas & questions
• Up next: Open Data Benefits and
Challenges
25. HIGH ASPIRATIONS,
LOW UPTAKE
• Berlin Declaration for Access to Knowledge in
the Sciences and Humanities (2003: 572
institutions)
• Recommendations for Access to Data from
Publicly Funded Research (2006, all OECD
member states)
26. CULTURE CHANGE?
A survey of 17,000 UK doctoral students
Showed that they are privately open to sharing
resources
But in practice, followed behaviors of supervisors
And fear losing future publication opportunities
Researchers of Tomorrow – The Research Behaviour of Generation Y Doctoral Students. London, United Kingdom: JISC. Retrieved from: http://www.jisc.ac.uk/publications/reports/2012/researchers-of-tomorrow.
Tenopir, C, Dalton, E D, Allard, S, Frame, M, Pjesivac, I, Birch, B, Pollock, D and Dorsett, K (2015). Changes in Data Sharing and Data Reuse Practices and Perceptions among Scientists Worldwide. PLoS ONE 10(8):
e0134826.DOI: https://doi.org/10.1371/journal.pone.0134826
27. STRUCTURAL BARRIERS
Small data could initially be published as part of
the original publication as tables
As size and complexity of data grew and
publishers enforced page limits, data publication
was prohibited or impossible
Klump, J., (2017). Data as Social Capital and the Gift Culture in Research. Data Science Journal. 16, p.14. DOI: http://doi.org/10.5334/dsj-2017-
014
31. RATIONALES FOR SHARING
RESEARCH DATA
• Stakeholders
• Researchers
• Public
• Journals
• Funders
• Libraries
• Motivations to share
• Needs of research community
• Needs of the public at large
• Beneficiaries of sharing
• Those who produce the data
• Those who use the data
33. Alexander, Ruth. “Reinhart, Rogoff... and Herndon: The student who caught out the profs.” BBC News
http://www.bbc.com/news/magazine-22223190
“This week, economists have
been astonished to find that a
famous academic paper often
used to make the case for
austerity cuts contains major
errors. Another surprise is that
the mistakes, by two eminent
Harvard professors, were
spotted by a student doing his
homework.”
REPLICATION/REPRODUCIBILITY
34. REPLICATION/REPRODUCIBILITY
• 90% of respondents to a recent survey in
Nature agreed that there is a ‘reproducibility
crisis’
• Increasing number of retractions
• Failures to replicate high profile studies
• Underlying causes
• Mechanized reporting of statistical results
• Publication bias towards statistically significant
results
Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454
(2016).
35. REPLICATION/REPRODUCIBILITY
• Transparency and Openness Promotion (TOP)
Guidelines
(https://osf.io/9f6gx/wiki/Guidelines/)
• Badges to articles with open data
• The Peer Reviewers' Openness Initiative
• Open Science Foundation Reproducibility
Project (https://osf.io/ezcuj/wiki/home/)
• Science Exchange Reproducibility Initiative
(http://validation.scienceexchange.com/#/)
36. JOURNAL MANDATES
• Mandatory requirement to archive data publically
unless there is a valid reason not to
• Response to low voluntary uptake
• To allow reproduction of reported results
• Ecology, evolution, biology
• These policies do work to increase data archiving
• However, the quality varies…
Roche DG, Kruuk LEB, Lanfear R, Binning SA (2015) Public Data Archiving in Ecology and Evolution: How Well Are We Doing? PLoS
Biol13(11): e1002295. https://doi.org/10.1371/journal.pbio.1002295
37. JOURNAL MANDATES
Researchers surveyed 100 datasets associated with
nonmolecular studies in journals that commonly
publish ecological and evolutionary research and
have a strong PDA policy.
Out of these datasets, 56% were incomplete, and
64% were archived in a way that partially or entirely
prevented reuse.
Roche DG, Kruuk LEB, Lanfear R, Binning SA (2015) Public Data Archiving in Ecology and Evolution: How Well Are We Doing? PLoS
Biol13(11): e1002295. https://doi.org/10.1371/journal.pbio.1002295
38. REPLICATION/REPRODUCIBILITY
"True reproducibility requires deep engagement
with the epistemological questions of a given
research specialty, and the very different ways in
which investigators obtain and value evidence“
“As rationale for sharing research,
reproducibility…risks reducing the research
process to a set of mechanistic procedures”
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
39. REPRODUCIBILITY AS
RATIONALE
• Where data deposit is required as condition of
publication (e.g. Protein Data Bank),
researchers will comply
• Data sharing more likely if
• Materials/documentation are automated
• Data is not sensitive/no licensing restrictions apply
• Publication is completed
• Data is not part of a long-term study integral to
researcher’s career
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
40. DISCUSSION
Is there a reproducibility crisis?
If so, to what extent can data sharing
remedy the crisis?
Other questions/comments
Up next: data management plans &
sharing for public use
42. PUBLIC USE
Tax monies should be leveraged to serve the public
good
Data should not be hoarded by researchers
Public understanding of research
Evidence-based advocacy
Education & teaching
Citizen science
Policymakers
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
43. OPEN GOVERNMENT DATA
White House Office of Science & Technology Policy
memo: “Expanding Public Access to the Results of
Federally Funded Research” (Feb 2013)
digitally formatted scientific data resulting
from unclassified research supported wholly
or in part by Federal funding should be stored
and publicly accessible to search, retrieve,
and analyze.
44. OPEN GOVERNMENT DATA
• Data is hard (or even
impossible) to find
• Data can not be
readily used
• Unavailable, unclear,
restrictive licensing
terms
https://blog.okfn.org/files/2017/06/FinalreportTheStateofOpenGovernmentDatain2017.pdf
Global Open Data Index
(GODI):
https://index.okfn.org/
“Measures the openness of
government data according to
the Open Definition”
45. DATA MANAGEMENT POLICIES
• NSF
• NIH
• NEH
• NASA
• NOAA
• CDC
• Gates Foundation
http://dms.data.jhu.edu/data-management-resources/plan-research/funders-data-sharing-requirement/funder-data-related-mandates-
and-public-access-plans/
46. DATA MANAGEMENT PLANS
• Roles and responsibilities
• Description of data and metadata
• Storage, Backup and security
• Provisions for Privacy, confidentiality,
intellectual property rights and other rights
• Data access and sharing
• Data reuse, redistribution and production of
derivatives
• Archiving and preservation
University of Iowa Libraries. Data Management Plans. Licensed under CC BY.
http://guides.lib.uiowa.edu/c.php?g=132111&p=900990`
47. NSF DATA SHARING POLICY
What constitutes reasonable data management
and access
…and reasonable length of time
will be determined by the community of interest
through the process of peer review and program
management
NSF Data Management & Sharing FAQ. https://www.nsf.gov/bfa/dias/policy/dmpfaqs.jsp
48. NSF DATA SHARING POLICY
Annual reports must include information on the
progress on data management and sharing of
research products
Final project reports are to contain a more
thorough updating of the original DMP, including
how your data is archived.
http://dms.data.jhu.edu/data-management-resources/plan-research/funders-data-sharing-requirement/funder-data-related-mandates-
and-public-access-plans/
49. PUBLIC HEALTH
WHO seeks a paradigm shift in the approach to
information sharing in emergencies, from one limited by
embargoes set for publication timelines, to open sharing
using modern fit-for-purpose pre-publication platforms.
Opting in to data and results sharing should be the
default practice and the onus should be placed on data
generators and stewards at the local, national and
international level to explain any decision to opt out from
sharing data and results during public health emergencies
World Health Organization “Developing global norms for sharing data and results during public health emergencies”
50. PUBLIC HEALTH
Many publishers, NGOs and research funders
committed to free research sharing in light of the
Zika outbreak
Wellcome Trust. “Statement on data sharing in public health emergencies.“
Journal signatories will make all content concerning the Zika virus free to access. Any data or
preprint deposited for unrestricted dissemination ahead of submission of any paper will not pre-empt its
publication in these journals.
Funder signatories will require researchers undertaking work relevant to public health emergencies to
set in place mechanisms to share quality-assured interim and final data as rapidly and widely as
possible, including with public health and research communities and the World Health Organization.
Wiley, Taylor and Francis, and Elsevier are not signatories.
51. SHERPA JULIET
Searchable database and single focal point of
up-to-date information concerning funders'
policies and their requirements on open access,
publication and data archiving.
http://v2.sherpa.ac.uk/juliet/
54. ASKING NEW QUESTIONS OF
EXTANT DATA
• Encourages meta-analyses & data combination
• Exploring new questions and identifying new
relationships
55. HUBBLE SPACE TELESCOPE
DATA REUSE
• General Observing (GO) paper: At least one author was
investigator on the GO proposal that obtained the data.
• AR paper: No overlap between the paper authors and investigators
on the GO proposal that obtained the data.
• GO+AR: Combination of GO data sets with AR data sets.
Adapted from Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The Open Research Challenge: Peer Review and Publication of Research Data“ Licensed under CC BY.
Royal Society June 2012, Science as an Open Enterprise,
http://royalsociety.org/policy/projects/science-public-
enterprise/report/
Papers based upon
reuse of archived
observations now
exceed those based
on the use described
in the original
proposal.
https://archive.stsci.edu/hst/bibliography/pubstat.html
56. ASKING NEW QUESTIONS OF
EXTANT DATA
• Assessing veracity requires domain expertise &
misinterpretation is a serious risk
• Depends on extensive documentation &
description
• The farther the user is from the point of data
origin
• The more documentation required
• The more effort required by reuser
• Greater the risk of misinterpretation
• Benefits prospective users more than
producers
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
57. ADVANCING RESEARCH AND
INNOVATION
• Data-intensive fields (astronomy, social
sciences, economics)
• Comparisons across time and space (ecology,
biology, sociology)
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
58. ADVANCING RESEARCH AND
INNOVATION
• Maximizing the use of data
• Increasing the impact of findings
• Progressing the state of research
• Laying broader foundation for knowledge
• Diversifying perspectives
Fischer, B.A., & Zigmond, M.J. (2010). The essential nature of sharing in science. Science and Engineering Ethics, 16(4), 783–799.
59. DATA SHARING ASSOCIATED
WITH CITATION IMPACT
Examined the citation
history of 85 cancer
microarray clinical trial
publications with respect to
the availability of their
data.
The 48% of trials with
publicly available
microarray data received
85% of the aggregate
citations
Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE2(3): e308.
https://doi.org/10.1371/journal.pone.0000308
60. DATA SHARING ASSOCIATED
WITH CITATION IMPACT
Does not imply causation
• But there may be mechanisms in which data
sharing did stimulate greater citations
• Exposure
• Reanalysis
• Enthusiasm and synergy around a specific research
question
Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE2(3): e308.
https://doi.org/10.1371/journal.pone.0000308
62. RESEARCHER CONCERNS
• Data is competitive advantage
• Data is intellectual capital
• Time & effort required to prepare data for
archiving
• Lack of recognition & other extrinsic incentives
• Concerns about data misinterpretation
Roche, D. G., Kruuk, L. E. B., Lanfear, R., & Binning, S. A. (2015). Public data archiving in ecology and evolution: How well are we doing?
PLoS Biology, 13(11) doi:http://dx.doi.org/10.1371/journal.pbio.1002295
63. OPEN QUESTIONS
• What data to share?
• What is sharing?
• What is interpretable and reusable?
• How to reward/give credit?
• How to document without extensive labor?
• How to handle misuse/misinterpretation?
• Restricting access/de-identification
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
64. OPEN QUESTIONS
• Lack of demonstrated demand for research data
outside genomics, climate science, astronomy,
social science, demographics
• How open is it?
• Who owns the copyright? Is data public
domain?
• How to validate data?
• Preserving data
Borgman, C. L. (2012), The conundrum of sharing research data. J Am Soc Inf Sci Tec, 63: 1059–1078. doi:10.1002/asi.22634
Data Sharing in the Sciences
69. ACCESSING & USING
OPEN DATA
• Open source software: R
• rOpenSci (ropensci.org)
• rOpenGov (https://ropengov.github.io/projects/)
• Run My Code (http://www.runmycode.org)
• Google Public Data Explorer
(https://www.google.com/publicdata)
www.r-project.org