The document discusses data sharing policies and mandates from various organizations including federal funding agencies in the US and internationally, journals, and a paradigm shift toward more transparent and collaborative research that integrates publications and data. Key points include requirements for data management plans from NIH and NSF, expectations of funding agencies in other countries to maximize access to research data, a journal policy requiring data to be made available, and challenges around measuring the impact of shared data given the lack of common practices and standards for citing data.
2. NIH (2003) Data Sharing Policy that all funding
applications of $500,000 or more per year are
expected to address data-sharing in their
application.
NSF (2011) All funding proposals submitted on
or after January 18, 2011, must include a “Data
Management Plan” describing how the
proposal will conform to NSF policy on the
dissemination and sharing of research results.
US Federal Funding Mandates
3. International Mandates
Aug 2011… “expectation that all our funded
researchers should maximise access to
their research data with as few
restrictions as possible. …. submit a data
management and sharing plan as part
of the application process.”
2007… “Researchers are to retain research
data and primary materials, manage
storage of research data and primary
materials, maintain confidentiality of
research data and primary materials.”
4. Journal Mandates
Dec 2013 . . .“We ask you to make available the data
underlying the findings in the paper, which would be
needed by someone wishing to understand, validate or
replicate the work. Our policy has not changed in this regard.
What has changed is that we now ask you to say where the
data can be found.
As the PLOS data policy applies to all fields in which we
publish, we recognize that we’ll need to work closely with
authors in some subject areas to ensure adherence to the new
policy. Some fields have very well established standards and
practices around data, while others are still evolving, and we
would like to work with any field that is developing data
standards. We are aiming to ensure transparency about
data availability.”
5. Questions
Sharing data—how does it happen?
What is data publishing?
Is data archiving the same?
How can we find data, access it, and reuse it?
How can we measure the impact of sharing data?
What’s the common denominator?
6.
7. Paradigm Shift
The nature of research has become…
More quantitative/data-intensive
More funder-driven
More interdisciplinary/collaborative
More transparent
More complicated in terms of cross-linking
More diverse in terms of citable scholarly
outputs
8. The focus of scholarly communication
has changed…
From:
Preserve publications
Preserve data
Preserve both (at least separately)
To:
Preserve publications and data ‘together’
Preserve the ‘relationships’ among them
Paradigm Shift
11. Data Dissemination Methods Indicated in
DMPs Written by UM Engineering Faculty
journal
publication
42%
faculty/
project website
36%
conference
presentation
11%
"upon request"
11%
NSF Engineering Data
Management Plan Analysis,
N=156
12. Data Dissemination Methods
Submitted with journal article
Appear in journal article upon publication
Supplemental materials (including codebooks)
Websites (prior/post publication)
Institutional repositories (prior/post publication)
Data archive per discipline’s culture of sharing
Data repository (may be assigned by journal
publishers)
Data papers in data journals (may be independent of
the journal article)
“Data upon request” via email (some/all)
13. Repository Directory Lists
IR
OpenDOAR (over 2600 academic open access repositories
listed)
Deep Blue (University of Michigan Library)
DR
NIH Data Sharing Repositories (57 repositories)
Thomson Reuters Data Citation Index (174 repositories)
Databib (975 repositories listed)
re3Data.org (609 repositories listed)
DataCite, re3data.org, and Databib announced collaboration
towards one service under the auspices of DataCite by 2015
14. Disciplinary Data Repositories:
What to Look for?
Subject/Discipline focus
Hosted by…
Access to data: open vs. restricted
Deposit of data: open vs. restricted
Deposit fee
Persistent identifiers (DOI, hdl)
Sustainability & preservation policy
(Non-) Proprietary file formats
Amount of data description/metadata
(data package level, file level, data item level)
Associated code/software
15. More on Persistent IDs
A DOI is a system for persistently identifying and locating digital objects;
Originally designed and developed for “journal articles”; ISO 26324 since 2012
DOI can be assigned by only DOI registration agencies: e.g. DataCite, CrossRef
Assigning DOI is not free (e.g. Costing ~$1 per DOI via CrossRef in 2013)
DOI: prefix + suffix
• e.g. DOI for a dataset http://doi.org/10.3886/ICPSR27282.v1
DOI prefix is unique to each publisher/repository
• ICPSR: 10.3886
• UK Data Service: 10.5255
• Figshare: 10.6084
• PANGAEA: 10.1594
• Dyad: 10.5061
Very similar to ‘handles’ in terms of persistency
• e.g. U of M IR Deep Blue: e.g. http://hdl.handle.net/2027.42/106575
Moving towards “Data with DOI” just as any scholarly articles
18. Data Journals
Number of ‘Data Journals’
As of today, 70+ data journals*
Journal host
a) Authors
b) Journals
c) Publisher data repositories
d) Data repositories (IR/DR)
Data journal article structure
a) Intro/Overview
b) Methods
c) Dataset description
d) Reuse potential
Source: K. Akers and J. Green. Data Sharing and Publication,
Presented at the Cyberinfrastructure (CI) Days Event, University
of Michigan, Ann Arbor, MI, November 13-14, 2013.
UP
*Note: To see a full list of data journals that currently exist, see
K. Akers’ blog post at:
http://mlibrarydata.wordpress.com/2014/05/09/data-journals/
19. Data Journal Example
Geoscience Data Journal by Wiley
Launched in Fall 2012
Published on behalf of Royal Meteorological Society
OA with author-pay model ($1,500 per article)
Publishes short data papers cross-linked to (and citing)
datasets that have been deposited in approved data
centers/repositories and awarded DOIs.
A data article describes a dataset, giving details of its
collection, processing, file formats etc., but does not go
into detail of any scientific analysis of the dataset or
draw conclusions from that data.
The data paper should allow the reader to understand
the when, why and how the data was collected, and what
the data is.
20. Data Journal Example (continued)
Data centers/repositories approved by Geoscience Data Journal
3TU.Datacentrum
British Atmospheric Data Centre (BADC)
British Oceanographic Data Centre (BODC)
CISL Research Data Archive
CSIRO Data Access Portal
Environmental Information Data Centre (EIDC)
Figshare
IEDA:EarthChem
IEDA:MGDS
National Center for Atmospheric Research (NCAR), USA
Earth Observing Lab (EOL), observational and supporting data from atmospheric science field
experiments and arctic research
Research Data Archive (RDA), reference datasets for weather and climate research
National Geoscience Data Centre (NGDC)
NERC Earth Observation Data Centre (NEODC)
NOAA National Climatic Data Center (NCDC)
NOAA National Oceanographic Data Center (NODC)
NOAA National Geophysical Data Center (NGDC)
PANGAEA
Polar Data Centre (PDC)
Zenodo
22. Data Publisher Examples
Wiley
Geoscience Data Journal
Ubiquity Press
Journal of Open Archaeology Data
Journal of Open Psychology Data
Open Health Data
Journal of Open Research Software
Nature
Scientific Data
23. Data Journal Examples (to name only a
few): Some Feature Comparison
Publisher Journal OA?
Publication
Fee per Article
Publisher
hosts data?
Approved data
center/
repositories
recommended
for data deposit?
How is the article
called?
DOI?
Wiley
Geoscience
Data Journal
Yes $1,500 No Yes ‘Data Paper’ Yes
Ubiquity
Press
Open
Archeology
Data
Yes $40 No Yes ‘Data Paper’ Yes
Nature
Publishing
Group
Scientific
Data
Yes $700 No Yes ‘Data Descriptor’ Yes
24.
25. Located on U of M Campus
www.icpsr.umich.edu ICPSR: Inter-university Consortium for Political and Social Research
26.
27. Signs of a Trusted Repository
A unit of ISR, ICPSR is governed by a Counsel representing
over 700 member institutions, including U of M
Long-term sustainability: “publishing” data for 52 years
Largest social science data repository in US with a catalog
of over 8,000 studies containing thousands of files
Awarded the Data Seal of Approval from DANS
Federal agencies’ archives are housed at ICPSR and fully
integrated with ICPSR’s collection
Data preservation standards followed for data long-term,
guarding against deterioration, accidental loss, and digital
obsolescence
Data are screened for confidentiality and privacy concerns.
Stringent protections are in place for securing and
distributing sensitive data.
Physical and virtual data enclaves for analyzing restricted-
use data
28. Rich Metadata for Better Access,
Discovery, Context, and Reuse
ICPSR formats, organizes and enhances deposited raw
research data with meaningful metadata and
documentation to make it complete, self-explanatory, and
usable for future researchers
Study metadata and codebooks are generated according to
the Data Documentation Initiative (DDI) XML standard
Search and filter online catalog with fielded metadata
records to enhance discovery; side-by-side comparison using
structured variable-level documentation in XML, tagged
according to the DDI standard
All studies are registered with a unique identifier—DOIs
from DataCite. ICPSR has been providing citations to its
data since 1990 and started assigning DOIs in 2008
34. Top 10 Data Downloads (last six months)
(non-anonymous, distinct users downloading one or more files)
Title Archive # Downloads
National Longitudinal Study of Adolescent Health (Add Health),
1994-2008
DSDR 1,188
General Social Survey, 1972-2012 [Cumulative File] ICPSR 737
Chinese Household Income Project, 2002 DSDR 720
India Human Development Survey (IHDS), 2005 SAMHDA 445
Collaborative Psychiatric Epidemiology Surveys (CPES), 2001-2003
[United States]
CPES 407
National Survey on Drug Use and Health, 2012 SAMHDA 314
Children of Immigrants Longitudinal Study (CILS), 1991-2006 DSDR 289
National Crime Victimization Survey, 2012 NACJD 260
National Prisoner Statistics, 1978-2011 NACJD 249
Historical, Demographic, Economic, and Social Data: The United
States, 1790-2002
ICPSR 245
35. Who uses these shared data?
How are they used?
With what impact?
36.
37.
38. The ICPSR Bibliography of
Data-related Literature
Link research data to the scholarly literature about it
Aid students, instructors, researchers, and funders to
discover and understand data use
A searchable database currently containing over 65,000
citations of known published and unpublished works
resulting from analyses of data archived at ICPSR
It generates study bibliographies linking each study with
the literature about it, and out to the full text
46. Altmetrics for research data
Easier to access and analyze much more
research data online
New focus on sharing that research data
Increasing use of social media to discuss, via
tweets, likes and blog posts
More online tools to download, collaborate
and share, like Mendeley, Figshare,
SlideShare, Dryad and ResearchGate,
DeepBlue, openICPSR
Dependent on good citation practice
47. Publishers
Springer
Elsevier
Wiley
Cambridge
Journals
BMJ Journals
Nature
Publish
Group
PLoS
Altmetrics
Aggregators
• Altmetric
• ImpactStory
• Plum
Analytics
Funders
• NSF
• Sloan
Foundation**
• MacMillan
• EBSCO
**The Alfred P. Sloan Foundation helps fund
ImpactStory, and is now funding the National
Information Standards Organization (NISO) to
develop standards and recommended best
practices for altmetrics.
48. Impact Story:
Product-level Metric
“New ways to measure the research impact . . . of
emerging products like blog posts, datasets, and
software . . . to build a new scholarly reward system
that values and encourages web-native scholarship.”
Open metrics, with context, using diverse products
to provide researchers with a “comprehensive impact
report” of their research output
Source: https://impactstory.org/about
51. Integration with Web of Science All
Databases: Research data is equal
to research literature
52. Articles linked to underlying data.
Increased data discovery.
Reward for data citation.
Potential for automated tracking.
53. Elsevier Connect
“Elsevier is collaborating with a rapidly growing number of
external data set repositories to optimize interoperability
between their data sets and research articles on
ScienceDirect. As part of the Article of the Future project,
this reciprocal linking aims to expand the availability of
research data and improve the researcher workflow.”
“Elsevier encourages authors to submit their data sets to
external repositories. . . But not all authors know how or
where to submit their data, and not all authors are aware of
the possibilities that data linking offers. . .The recent
agreement with Dryad Digital Repository marked the 35th
data linking partnership Elsevier has established. . .”
Source: http://www.elsevier.com/connect/bringing-data-to-life-with-data-linking
55. For Better Metrics on Research
Data Impact
Need more aggregator and repository data to be
exposed for altmetric harvesters like ImpactStory
More integrated efforts among libraries, publishers,
archives, and funders. For example:
The Data Conservancy, IEEE, and Portico receive
Alfred P. Sloan Foundation grant to connect
publications and their linked data
59. No Common Practice of Formal
Data Citation
Abstract?
Acknowledgements?
Charts and Tables?
Appendices?
Discussion?
Footnotes?
Sample?
Methods?
References!
Without an explicit
citation, reader must
infer or be out of luck
No attribution—no credit
No access—no reuse
No discernible impact!
60. Examples of Bad Data Citation
Poorly described and cited data
+
Excessive human search effort, extensive collection
knowledge
=
Too costly, too questionable for confident measure
of impact
61.
62. Examples of Good Data Citation
Formal data
Citing with
a DOI
+
Minimal human search effort
=
High hit accuracy for the cost, and better
confidence of impact measures
63.
64. Basic Data Citation Format
Creator (Year) Title. Publisher. Identifier
(For datasets that have DOIs, DataCite and CrossRef provide a citation
formatter to generate a citation in various journal styles.)
Core Elements
Creator(s): Individual(s) or organization responsible for creating
the dataset.
Year: Year the dataset was published, not necessarily created.
Title: Should be as descriptive as possible
Publisher: Organization that provides access to the dataset (e.g.
Dryad, Zenodo)
Identifier: Persistent, unique identifier (e.g. a DOI)
Source: http://datapub.cdlib.org/datacitation/
How to Cite Data
65. Additional Elements
Location / Availability: The web address of the dataset is essential
when the identifier can’t be used to reach the dataset.
Version / Edition: Version of the dataset used in the present
publication. Needed to reproduce analysis of versioned dynamic
datasets.
Access Date: Date of access for analysis in the present publication.
Needed to reproduce analysis of continuously updated dynamic
datasets.
Format / Material Designator: e.g., database, CD-ROM.
Feature Name: A description of the subset of the dataset used. May be
a formal title or a list of variables (e.g., concentration, optical density).
Verifier: Used to confirm that two datasets are identical. Most
commonly a UNF or MD5 checksum.
Series: Used if the dataset is part of series of releases (e.g., monthly)
Contributor: e.g., editor, compiler
Source: http://datapub.cdlib.org/datacitation/
How to Cite Data
66. Data Citation Examples
Deschenes, Elizabeth Piper, Susan Turner, and Joan Petersilia.
Intensive Community Supervision in Minnesota, 1990-1992: A Dual
Experiment in Prison Diversion and Enhanced Supervised Release.
ICPSR06849-v1. Ann Arbor, MI: Inter-university Consortium for
Political and Social Research [distributor], 2000.
doi:10.3886/ICPSR06849.v1
Esther Duflo; Rohini Pande, 2006, "Dams, Poverty, Public Goods
and Malaria Incidence in India",
http://hdl.handle.net/1902.1/IOJHHXOOLZ
UNF:5:obNHHq1gtV400a4T+Xrp9g== Murray Research Archive
[Distributor] V2 [Version]
Sidlauskas B (2007) Data from: Testing for unequal rates of
morphological diversification in the absence of a detailed
phylogeny: a case study from characiform fishes. Dryad Digital
Repository. doi:10.5061/dryad.20
67. Joint Declaration of Data
Citation Principles
1. Future Of Research Communication and
E-Scholarship (FORCE11)
2. Committee on Data for Science and
Technology (CODATA)
3. Digital Curation Centre (DCC)
Source: https://www.force11.org/datacitation
68. Eight Principles
1. Importance--Data should be considered
legitimate, citable products of research. Data
citations should be accorded the same importance
in the scholarly record as citations of other
research objects, such as publications.
2.Credit and Attribution--Data citations should
facilitate giving scholarly credit and normative and
legal attribution to all contributors to the data,
recognizing that a single style or mechanism of
attribution may not be applicable to all data.
69. Eight Principles
3. Evidence—In scholarly literature, whenever and
wherever a claim relies upon data, the
corresponding data should be cited.
4. Unique Identification—A data citation should
include a persistent method for identification
that is machine actionable, globally unique, and
widely used by a community.
70. Eight Principles
5. Access—Data citations should facilitate access to
the data themselves and to such associated
metadata, documentation, code, and other
materials, as are necessary for both humans and
machines to make informed use of the referenced
data.
6.Persistence—Unique identifiers, and metadata
describing the data, and its disposition, should
persist -- even beyond the lifespan of the data they
describe.
71. Eight Principles
7. Specificity and Verifiability—Data citations
should facilitate identification of, access to, and
verification of the specific data that support a
claim.
Citations or citation metadata should include
information about provenance and fixity
sufficient to facilitate verifying that the specific
timeslice, version and/or granular portion of data
retrieved subsequently is the same as was
originally cited.
72. Eight Principles
8. Interoperability and flexibility—Data citation
methods should be sufficiently flexible to
accommodate the variant practices among
communities, but should not differ so much that
they compromise interoperability of data citation
practices across communities.
73. Make Your Data Count
If it’s not cited, it can’t be counted
Without counting data use, there is no
accurate way to measure the impact of your
shared data
Without a well-formed citation, your data
cannot take advantage of the potential of
linked scholarly publishing
Store your data where citations are unique and
persistent
Cite your own data and others’ in your
publications
74. Questions Answered?
Sharing data—how does it happen?
What is data publishing?
Is data archiving the same?
How can we find data, access it, and reuse it?
How can we measure the impact of sharing data?
What’s the common denominator?