Data Publishing Models and Workflows

Data Publishing Models
Sünje Dallmeier-Tiessen, PhD
CERN, Harvard University
For the RDA-WDS Data Publishing Workflow Group
June 9th, 2015

Topics
• What is data publishing
• Why do we care about it (today)
• Models in data publishing
• Building blocks
• Information gathered through trusted data publishing
• Relevance and conclusions for today’s workshop
This is work conducted by the RDA-WDS group on data
publishing workflows, chaired in collaboration with Fiona
Murphy and Theo Bloom.

Data Publishing
… describes the process of making research data and
other research objects available on the web so that they
can be discovered and referred to in a unique and
persistent way.
At its best, data publishing takes place through dedicated
data repositories and data journals and ensures that the
published research objects are well documented, curated,
archived for the long term, interoperable, citable and
quality assured.
Thus, they are reusable and discoverable on the long
term.

Analysis elements
• Discipline, responsible units (i.e. their roles)
• Function of workflow
• PID assignment: DOI, ARK, etc.
• Peer review of data (e.g. by researcher & editorial review)
• Curatorial review of metadata (e.g. by institutional or subject repository?)
• Technical review & checks (e.g. for data integrity at repository upon
ingestion)
• Formats covered
• Persons/Roles involved, e.g. editor, publisher, data repository manager,
etc.
• Links to additional data products (data paper; review documents; other
journal articles) or “stand-alone” product
• Links to grants, usage of author PIDs
• Discoverability: Indexing of the data -- if yes, where?
• Data citation facilitated
• Data life cycle reference
• Standards compliance

Data
Deposit
Ingest
Quality
Assurance
Data
Management
LT Archiving
Dissemination
Access
Producer Consumer/
Reuse
Simplified generic repository
workflow
Researcher with a central role during submission/deposition
Review/QA
mainly
internal
through
dedicated
curation
personnel

Data
Deposit
Ingest
Quality
Assurance
Light
Data
Management
LT Archiving
Dissemination
Access
Producer
Consumer
(disciplinary)
Ingest
Quality
Assurance
Detailed
Project Repositories:
• Data are published in a federated
data infrastructure
• Data are added and corrected
• Poor documentation
• Usually no data backup
• Light-weight quality assurance
against intl. and project standards
• Tendency that the project data
never become stable
• Currently no PIDs assigned or
reserved but Handles planned
Long-term Archive:
• Data are archived for the long term at a
single location
• Data are stable and curated
• Detailed documentation
• Data backup/redundancy
• Quality assurance process is more
detailed and includes a review
• Data is a “snapshot” of the project
data at a certain time
• DOIs assigned to data collections
Consumer
(interdisciplinary)
Dissemination
Access
Content provided by
M. Stockhause
Disciplinary
repository
example

Lessons learnt and questions
• Very diverse landscape
• Discipline-specific and cross-discipline actions
• Quality assurance a big topic in discipline-specific
repositories
• Widespread persistent identification
• Data citation awareness
• Challenge: Versioning

Article
preparation
Data
Submission
Article
submission
Peer Review
Process EditingProducer Consumer/
Reuse
Simplified generic publisher workflow
Researcher takes over several roles: submitter, reviewer,
editor potentially?
- Article/data
container
- Separate
article and
datasets
Publishing
Data
repositories

Example Workflows in Dataverse:
Connect Data to Journals
A. Journals include Dataverse as a Recommended Repository
B. Authors Contribute Directly to a Journal’s Dataverse
C. Automated Integration of Journal + Dataverse (e.g., OJS)
Slide by Eleni Castro

Example: Dryad repository integrated
with journals
Slide by T. Bloom

Data publishing building blocks
Primary data
entry with PID
Repository
entry
Metadata
Curation
Parallel data
description
Data Paper or
link to it
Link to results
paper
Linked and
published quality
assurance
Curation,
Editing
process
Peer review
Any kind of
QA process
Additional
visibility
Push to
ORCID, author
pages,
impact/reput
ation building
tools
Enable index
(Data citation
index, crawled
by Google)
Basic published
product
Add-ons: workflows for more documentation, QA, visibility

Trusted data publishing contains:
• Standardized information about the data
– Disciplinary standards
– Basic common metadata sets
• Distinct Roles, Workflows and Responsibilities
– Authorship, Submission
– Curation
– Quality Assurance
– Peer review
• Persistent Identification
– Permanent reference
– Data citation

Challenges
• Interoperability challenges
– Different metadata schemas
– Rich vs. limited metadata
• Discoverability challenges
– E.g. no bi-directional linking
– Usability challenges in aggregators
• Metrics and accreditation
• What information is needed for future
reuse/remix/reproducibility
• How can this information be exposed – human
and machine readable

Data Publishing Workflows
Activities and processes in a digital environment
that lead to the publication of research data and
other research objects on the Web. These
activities may be performed by humans or in an
automated fashion.
In contrast to the interim or final published
products, workflows are the means to curate,
document, peer review and thus ensure and
enhance the value of the published product.

Data Publishing Models and Workflows

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Data Publishing Models and Workflows

Similaire à Data Publishing Models and Workflows (20)

Plus de datascienceiqss

Plus de datascienceiqss (13)

Dernier

Dernier (20)

Data Publishing Models and Workflows