Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing the Underlying Data

Enriching Scholarship
May 6, 2014
Natsuko Nicholls, UM Libraries
Elizabeth Moss, ICPSR

NIH (2003) Data Sharing Policy that all funding
applications of $500,000 or more per year are
expected to address data-sharing in their
application.
NSF (2011) All funding proposals submitted on
or after January 18, 2011, must include a “Data
Management Plan” describing how the
proposal will conform to NSF policy on the
dissemination and sharing of research results.
US Federal Funding Mandates

International Mandates
Aug 2011… “expectation that all our funded
researchers should maximise access to
their research data with as few
restrictions as possible. …. submit a data
management and sharing plan as part
of the application process.”
2007… “Researchers are to retain research
data and primary materials, manage
storage of research data and primary
materials, maintain confidentiality of
research data and primary materials.”

Journal Mandates
Dec 2013 . . .“We ask you to make available the data
underlying the findings in the paper, which would be
needed by someone wishing to understand, validate or
replicate the work. Our policy has not changed in this regard.
What has changed is that we now ask you to say where the
data can be found.
As the PLOS data policy applies to all fields in which we
publish, we recognize that we’ll need to work closely with
authors in some subject areas to ensure adherence to the new
policy. Some fields have very well established standards and
practices around data, while others are still evolving, and we
would like to work with any field that is developing data
standards. We are aiming to ensure transparency about
data availability.”

Questions
 Sharing data—how does it happen?
 What is data publishing?
 Is data archiving the same?
 How can we find data, access it, and reuse it?
 How can we measure the impact of sharing data?
 What’s the common denominator?

Paradigm Shift
The nature of research has become…
 More quantitative/data-intensive
 More funder-driven
 More interdisciplinary/collaborative
 More transparent
 More complicated in terms of cross-linking
 More diverse in terms of citable scholarly
outputs

The focus of scholarly communication
has changed…
From:
 Preserve publications
 Preserve data
 Preserve both (at least separately)
To:
 Preserve publications and data ‘together’
 Preserve the ‘relationships’ among them
Paradigm Shift

Publishing and Archiving
Scholarly
Communication
Availability Citability Validation
Scholarly
Publishing Data Archiving
Scholarly Publishing that
includes ‘Data Publication’

Data Dissemination Methods Indicated in
DMPs Written by UM Engineering Faculty
journal
publication
42%
faculty/
project website
36%
conference
presentation
11%
"upon request"
11%
NSF Engineering Data
Management Plan Analysis,
N=156

Data Dissemination Methods
 Submitted with journal article
 Appear in journal article upon publication
 Supplemental materials (including codebooks)
 Websites (prior/post publication)
 Institutional repositories (prior/post publication)
 Data archive per discipline’s culture of sharing
 Data repository (may be assigned by journal
publishers)
 Data papers in data journals (may be independent of
the journal article)
 “Data upon request” via email (some/all)

Repository Directory Lists
 IR
 OpenDOAR (over 2600 academic open access repositories
listed)
 Deep Blue (University of Michigan Library)
 DR
 NIH Data Sharing Repositories (57 repositories)
 Thomson Reuters Data Citation Index (174 repositories)
 Databib (975 repositories listed)
 re3Data.org (609 repositories listed)
DataCite, re3data.org, and Databib announced collaboration
towards one service under the auspices of DataCite by 2015

Disciplinary Data Repositories:
What to Look for?
 Subject/Discipline focus
 Hosted by…
 Access to data: open vs. restricted
 Deposit of data: open vs. restricted
 Deposit fee
 Persistent identifiers (DOI, hdl)
 Sustainability & preservation policy
 (Non-) Proprietary file formats
 Amount of data description/metadata
(data package level, file level, data item level)
 Associated code/software

More on Persistent IDs
 A DOI is a system for persistently identifying and locating digital objects;
Originally designed and developed for “journal articles”; ISO 26324 since 2012
 DOI can be assigned by only DOI registration agencies: e.g. DataCite, CrossRef
 Assigning DOI is not free (e.g. Costing ~$1 per DOI via CrossRef in 2013)
 DOI: prefix + suffix
• e.g. DOI for a dataset http://doi.org/10.3886/ICPSR27282.v1
 DOI prefix is unique to each publisher/repository
• ICPSR: 10.3886
• UK Data Service: 10.5255
• Figshare: 10.6084
• PANGAEA: 10.1594
• Dyad: 10.5061
 Very similar to ‘handles’ in terms of persistency
• e.g. U of M IR Deep Blue: e.g. http://hdl.handle.net/2027.42/106575
 Moving towards “Data with DOI” just as any scholarly articles

Data Repositories
Let’s take a closer look at this example!

Data Papers: Going beyond Appendices and
Supplements

Data Journals
 Number of ‘Data Journals’
As of today, 70+ data journals*
 Journal host
a) Authors
b) Journals
c) Publisher data repositories
d) Data repositories (IR/DR)
 Data journal article structure
a) Intro/Overview
b) Methods
c) Dataset description
d) Reuse potential
Source: K. Akers and J. Green. Data Sharing and Publication,
Presented at the Cyberinfrastructure (CI) Days Event, University
of Michigan, Ann Arbor, MI, November 13-14, 2013.
UP
*Note: To see a full list of data journals that currently exist, see
K. Akers’ blog post at:
http://mlibrarydata.wordpress.com/2014/05/09/data-journals/

Data Journal Example
Geoscience Data Journal by Wiley
 Launched in Fall 2012
 Published on behalf of Royal Meteorological Society
 OA with author-pay model ($1,500 per article)
 Publishes short data papers cross-linked to (and citing)
datasets that have been deposited in approved data
centers/repositories and awarded DOIs.
 A data article describes a dataset, giving details of its
collection, processing, file formats etc., but does not go
into detail of any scientific analysis of the dataset or
draw conclusions from that data.
 The data paper should allow the reader to understand
the when, why and how the data was collected, and what
the data is.

Data Journal Example (continued)
Data centers/repositories approved by Geoscience Data Journal
 3TU.Datacentrum
 British Atmospheric Data Centre (BADC)
 British Oceanographic Data Centre (BODC)
 CISL Research Data Archive
 CSIRO Data Access Portal
 Environmental Information Data Centre (EIDC)
 Figshare
 IEDA:EarthChem
 IEDA:MGDS
 National Center for Atmospheric Research (NCAR), USA
 Earth Observing Lab (EOL), observational and supporting data from atmospheric science field
experiments and arctic research
 Research Data Archive (RDA), reference datasets for weather and climate research
 National Geoscience Data Centre (NGDC)
 NERC Earth Observation Data Centre (NEODC)
 NOAA National Climatic Data Center (NCDC)
 NOAA National Oceanographic Data Center (NODC)
 NOAA National Geophysical Data Center (NGDC)
 PANGAEA
 Polar Data Centre (PDC)
 Zenodo

Data Journal Example (continued)

Data Publisher Examples
 Wiley
 Geoscience Data Journal
 Ubiquity Press
 Journal of Open Archaeology Data
 Journal of Open Psychology Data
 Open Health Data
 Journal of Open Research Software
 Nature
 Scientific Data

Data Journal Examples (to name only a
few): Some Feature Comparison
Publisher Journal OA?
Publication
Fee per Article
Publisher
hosts data?
Approved data
center/
repositories
recommended
for data deposit?
How is the article
called?
DOI?
Wiley
Geoscience
Data Journal
Yes $1,500 No Yes ‘Data Paper’ Yes
Ubiquity
Press
Open
Archeology
Data
Yes $40 No Yes ‘Data Paper’ Yes
Nature
Publishing
Group
Scientific
Data
Yes $700 No Yes ‘Data Descriptor’ Yes

Located on U of M Campus
www.icpsr.umich.edu ICPSR: Inter-university Consortium for Political and Social Research

Signs of a Trusted Repository
 A unit of ISR, ICPSR is governed by a Counsel representing
over 700 member institutions, including U of M
 Long-term sustainability: “publishing” data for 52 years
 Largest social science data repository in US with a catalog
of over 8,000 studies containing thousands of files
 Awarded the Data Seal of Approval from DANS
 Federal agencies’ archives are housed at ICPSR and fully
integrated with ICPSR’s collection
 Data preservation standards followed for data long-term,
guarding against deterioration, accidental loss, and digital
obsolescence
 Data are screened for confidentiality and privacy concerns.
Stringent protections are in place for securing and
distributing sensitive data.
 Physical and virtual data enclaves for analyzing restricted-
use data

Rich Metadata for Better Access,
Discovery, Context, and Reuse
 ICPSR formats, organizes and enhances deposited raw
research data with meaningful metadata and
documentation to make it complete, self-explanatory, and
usable for future researchers
 Study metadata and codebooks are generated according to
the Data Documentation Initiative (DDI) XML standard
 Search and filter online catalog with fielded metadata
records to enhance discovery; side-by-side comparison using
structured variable-level documentation in XML, tagged
according to the DDI standard
 All studies are registered with a unique identifier—DOIs
from DataCite. ICPSR has been providing citations to its
data since 1990 and started assigning DOIs in 2008

Replication Datasets
http://www.icpsr.umich.edu/icpsrweb/deposit/pra/index.jsp

Open Sharing for DMP Proposals
http://openicpsr.org/

Top 10 Data Downloads (last six months)
(non-anonymous, distinct users downloading one or more files)
Title Archive # Downloads
National Longitudinal Study of Adolescent Health (Add Health),
1994-2008
DSDR 1,188
General Social Survey, 1972-2012 [Cumulative File] ICPSR 737
Chinese Household Income Project, 2002 DSDR 720
India Human Development Survey (IHDS), 2005 SAMHDA 445
Collaborative Psychiatric Epidemiology Surveys (CPES), 2001-2003
[United States]
CPES 407
National Survey on Drug Use and Health, 2012 SAMHDA 314
Children of Immigrants Longitudinal Study (CILS), 1991-2006 DSDR 289
National Crime Victimization Survey, 2012 NACJD 260
National Prisoner Statistics, 1978-2011 NACJD 249
Historical, Demographic, Economic, and Social Data: The United
States, 1790-2002
ICPSR 245

Who uses these shared data?
How are they used?
With what impact?

The ICPSR Bibliography of
Data-related Literature
 Link research data to the scholarly literature about it
 Aid students, instructors, researchers, and funders to
discover and understand data use
 A searchable database currently containing over 65,000
citations of known published and unpublished works
resulting from analyses of data archived at ICPSR
 It generates study bibliographies linking each study with
the literature about it, and out to the full text

Linking the Data to the Literature

Altmetrics for research data
 Easier to access and analyze much more
research data online
 New focus on sharing that research data
 Increasing use of social media to discuss, via
tweets, likes and blog posts
 More online tools to download, collaborate
and share, like Mendeley, Figshare,
SlideShare, Dryad and ResearchGate,
DeepBlue, openICPSR
 Dependent on good citation practice

Publishers
 Springer
 Elsevier
 Wiley
 Cambridge
Journals
 BMJ Journals
 Nature
Publish
Group
 PLoS
Altmetrics
Aggregators
• Altmetric
• ImpactStory
• Plum
Analytics
Funders
• NSF
• Sloan
Foundation**
• MacMillan
• EBSCO
**The Alfred P. Sloan Foundation helps fund
ImpactStory, and is now funding the National
Information Standards Organization (NISO) to
develop standards and recommended best
practices for altmetrics.

Impact Story:
Product-level Metric
 “New ways to measure the research impact . . . of
emerging products like blog posts, datasets, and
software . . . to build a new scholarly reward system
that values and encourages web-native scholarship.”
 Open metrics, with context, using diverse products
to provide researchers with a “comprehensive impact
report” of their research output
Source: https://impactstory.org/about

Artifact-level Metric
Source: http://www.plumanalytics.com/metrics.html

Integration with Web of Science All
Databases: Research data is equal
to research literature

Articles linked to underlying data.
Increased data discovery.
Reward for data citation.
Potential for automated tracking.

Elsevier Connect
 “Elsevier is collaborating with a rapidly growing number of
external data set repositories to optimize interoperability
between their data sets and research articles on
ScienceDirect. As part of the Article of the Future project,
this reciprocal linking aims to expand the availability of
research data and improve the researcher workflow.”
 “Elsevier encourages authors to submit their data sets to
external repositories. . . But not all authors know how or
where to submit their data, and not all authors are aware of
the possibilities that data linking offers. . .The recent
agreement with Dryad Digital Repository marked the 35th
data linking partnership Elsevier has established. . .”
Source: http://www.elsevier.com/connect/bringing-data-to-life-with-data-linking

Source: http://www.slideshare.net/ElsevierConnect/columbia-27feb13v2ext

For Better Metrics on Research
Data Impact
 Need more aggregator and repository data to be
exposed for altmetric harvesters like ImpactStory
 More integrated efforts among libraries, publishers,
archives, and funders. For example:
 The Data Conservancy, IEEE, and Portico receive
Alfred P. Sloan Foundation grant to connect
publications and their linked data

Formal Citation, in the References,
with the DOI
doi:10.3886/ICPSR21240

http://www.flickr.com/photos/papertrix/38028138/
Some Challenges

No Common Practice of Formal
Data Citation
 Abstract?
 Acknowledgements?
 Charts and Tables?
 Appendices?
 Discussion?
 Footnotes?
 Sample?
 Methods?
References!
 Without an explicit
citation, reader must
infer or be out of luck
 No attribution—no credit
 No access—no reuse
 No discernible impact!

Examples of Bad Data Citation
Poorly described and cited data
+
Excessive human search effort, extensive collection
knowledge
=
Too costly, too questionable for confident measure
of impact

Examples of Good Data Citation
Formal data
Citing with
a DOI
+
Minimal human search effort
=
High hit accuracy for the cost, and better
confidence of impact measures

Basic Data Citation Format
Creator (Year) Title. Publisher. Identifier
(For datasets that have DOIs, DataCite and CrossRef provide a citation
formatter to generate a citation in various journal styles.)
Core Elements
Creator(s): Individual(s) or organization responsible for creating
the dataset.
Year: Year the dataset was published, not necessarily created.
Title: Should be as descriptive as possible
Publisher: Organization that provides access to the dataset (e.g.
Dryad, Zenodo)
Identifier: Persistent, unique identifier (e.g. a DOI)
Source: http://datapub.cdlib.org/datacitation/
How to Cite Data

Additional Elements
Location / Availability: The web address of the dataset is essential
when the identifier can’t be used to reach the dataset.
Version / Edition: Version of the dataset used in the present
publication. Needed to reproduce analysis of versioned dynamic
datasets.
Access Date: Date of access for analysis in the present publication.
Needed to reproduce analysis of continuously updated dynamic
datasets.
Format / Material Designator: e.g., database, CD-ROM.
Feature Name: A description of the subset of the dataset used. May be
a formal title or a list of variables (e.g., concentration, optical density).
Verifier: Used to confirm that two datasets are identical. Most
commonly a UNF or MD5 checksum.
Series: Used if the dataset is part of series of releases (e.g., monthly)
Contributor: e.g., editor, compiler
Source: http://datapub.cdlib.org/datacitation/
How to Cite Data

Data Citation Examples
Deschenes, Elizabeth Piper, Susan Turner, and Joan Petersilia.
Intensive Community Supervision in Minnesota, 1990-1992: A Dual
Experiment in Prison Diversion and Enhanced Supervised Release.
ICPSR06849-v1. Ann Arbor, MI: Inter-university Consortium for
Political and Social Research [distributor], 2000.
doi:10.3886/ICPSR06849.v1
Esther Duflo; Rohini Pande, 2006, "Dams, Poverty, Public Goods
and Malaria Incidence in India",
http://hdl.handle.net/1902.1/IOJHHXOOLZ
UNF:5:obNHHq1gtV400a4T+Xrp9g== Murray Research Archive
[Distributor] V2 [Version]
Sidlauskas B (2007) Data from: Testing for unequal rates of
morphological diversification in the absence of a detailed
phylogeny: a case study from characiform fishes. Dryad Digital
Repository. doi:10.5061/dryad.20

Joint Declaration of Data
Citation Principles
1. Future Of Research Communication and
E-Scholarship (FORCE11)
2. Committee on Data for Science and
Technology (CODATA)
3. Digital Curation Centre (DCC)
Source: https://www.force11.org/datacitation

Eight Principles
1. Importance--Data should be considered
legitimate, citable products of research. Data
citations should be accorded the same importance
in the scholarly record as citations of other
research objects, such as publications.
2.Credit and Attribution--Data citations should
facilitate giving scholarly credit and normative and
legal attribution to all contributors to the data,
recognizing that a single style or mechanism of
attribution may not be applicable to all data.

Eight Principles
3. Evidence—In scholarly literature, whenever and
wherever a claim relies upon data, the
corresponding data should be cited.
4. Unique Identification—A data citation should
include a persistent method for identification
that is machine actionable, globally unique, and
widely used by a community.

Eight Principles
5. Access—Data citations should facilitate access to
the data themselves and to such associated
metadata, documentation, code, and other
materials, as are necessary for both humans and
machines to make informed use of the referenced
data.
6.Persistence—Unique identifiers, and metadata
describing the data, and its disposition, should
persist -- even beyond the lifespan of the data they
describe.

Eight Principles
7. Specificity and Verifiability—Data citations
should facilitate identification of, access to, and
verification of the specific data that support a
claim.
Citations or citation metadata should include
information about provenance and fixity
sufficient to facilitate verifying that the specific
timeslice, version and/or granular portion of data
retrieved subsequently is the same as was
originally cited.

Eight Principles
8. Interoperability and flexibility—Data citation
methods should be sufficiently flexible to
accommodate the variant practices among
communities, but should not differ so much that
they compromise interoperability of data citation
practices across communities.

Make Your Data Count
 If it’s not cited, it can’t be counted
 Without counting data use, there is no
accurate way to measure the impact of your
shared data
 Without a well-formed citation, your data
cannot take advantage of the potential of
linked scholarly publishing
 Store your data where citations are unique and
persistent
 Cite your own data and others’ in your
publications

Questions Answered?
 Sharing data—how does it happen?
 What is data publishing?
 Is data archiving the same?
 How can we find data, access it, and reuse it?
 How can we measure the impact of sharing data?
 What’s the common denominator?

Thank you!
Natsuko Nicholls
hayashin@umich.edu
Elizabeth Moss
eammoss@umich.edu

Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing the Underlying Data

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing the Underlying Data

Similar to Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing the Underlying Data (20)

Recently uploaded

Recently uploaded (20)

Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing the Underlying Data