Data Publishing is becoming an integral part of scholarly communication today. Thus, it is indispensable to understand how data publishing works across disciplines. Are there best practices others can learn from or even data publishing standards? How do they impact interoperability in the Open Science landscape? The presentation will look at a range of examples, and the main building blocks of data publishing today. The work has been conducted as part of the RDA Data Publishing Workflows group.
1. Data Publishing Models
Sünje Dallmeier-Tiessen, PhD
CERN, Harvard University
For the RDA-WDS Data Publishing Workflow Group
June 9th, 2015
2. Topics
• What is data publishing
• Why do we care about it (today)
• Models in data publishing
• Building blocks
• Information gathered through trusted data publishing
• Relevance and conclusions for today’s workshop
This is work conducted by the RDA-WDS group on data
publishing workflows, chaired in collaboration with Fiona
Murphy and Theo Bloom.
3. Data Publishing
… describes the process of making research data and
other research objects available on the web so that they
can be discovered and referred to in a unique and
persistent way.
At its best, data publishing takes place through dedicated
data repositories and data journals and ensures that the
published research objects are well documented, curated,
archived for the long term, interoperable, citable and
quality assured.
Thus, they are reusable and discoverable on the long
term.
8. Analysis elements
• Discipline, responsible units (i.e. their roles)
• Function of workflow
• PID assignment: DOI, ARK, etc.
• Peer review of data (e.g. by researcher & editorial review)
• Curatorial review of metadata (e.g. by institutional or subject repository?)
• Technical review & checks (e.g. for data integrity at repository upon
ingestion)
• Formats covered
• Persons/Roles involved, e.g. editor, publisher, data repository manager,
etc.
• Links to additional data products (data paper; review documents; other
journal articles) or “stand-alone” product
• Links to grants, usage of author PIDs
• Discoverability: Indexing of the data -- if yes, where?
• Data citation facilitated
• Data life cycle reference
• Standards compliance
11. Data
Deposit
Ingest
Quality
Assurance
Light
Data
Management
LT Archiving
Dissemination
Access
Producer
Consumer
(disciplinary)
Ingest
Quality
Assurance
Detailed
Project Repositories:
• Data are published in a federated
data infrastructure
• Data are added and corrected
• Poor documentation
• Usually no data backup
• Light-weight quality assurance
against intl. and project standards
• Tendency that the project data
never become stable
• Currently no PIDs assigned or
reserved but Handles planned
Long-term Archive:
• Data are archived for the long term at a
single location
• Data are stable and curated
• Detailed documentation
• Data backup/redundancy
• Quality assurance process is more
detailed and includes a review
• Data is a “snapshot” of the project
data at a certain time
• DOIs assigned to data collections
Consumer
(interdisciplinary)
Dissemination
Access
Content provided by
M. Stockhause
Disciplinary
repository
example
12. Lessons learnt and questions
• Very diverse landscape
• Discipline-specific and cross-discipline actions
• Quality assurance a big topic in discipline-specific
repositories
• Widespread persistent identification
• Data citation awareness
• Challenge: Versioning
15. Example Workflows in Dataverse:
Connect Data to Journals
A. Journals include Dataverse as a Recommended Repository
B. Authors Contribute Directly to a Journal’s Dataverse
C. Automated Integration of Journal + Dataverse (e.g., OJS)
Slide by Eleni Castro
17. Data publishing building blocks
Primary data
entry with PID
Repository
entry
Metadata
Curation
Parallel data
description
Data Paper or
link to it
Link to results
paper
Linked and
published quality
assurance
Curation,
Editing
process
Peer review
Any kind of
QA process
Additional
visibility
Push to
ORCID, author
pages,
impact/reput
ation building
tools
Enable index
(Data citation
index, crawled
by Google)
Basic published
product
Add-ons: workflows for more documentation, QA, visibility
18. Trusted data publishing contains:
• Standardized information about the data
– Disciplinary standards
– Basic common metadata sets
• Distinct Roles, Workflows and Responsibilities
– Authorship, Submission
– Curation
– Quality Assurance
– Peer review
• Persistent Identification
– Permanent reference
– Data citation
19. Challenges
• Interoperability challenges
– Different metadata schemas
– Rich vs. limited metadata
• Discoverability challenges
– E.g. no bi-directional linking
– Usability challenges in aggregators
• Metrics and accreditation
• What information is needed for future
reuse/remix/reproducibility
• How can this information be exposed – human
and machine readable
21. Data Publishing Workflows
Activities and processes in a digital environment
that lead to the publication of research data and
other research objects on the Web. These
activities may be performed by humans or in an
automated fashion.
In contrast to the interim or final published
products, workflows are the means to curate,
document, peer review and thus ensure and
enhance the value of the published product.