How can we ensure research data is re-usable? The role of Publishers in Research Data Management, by Catriona MacCallum. 2nd LEARN Workshop, Vienna, 6th April 2016
How can we ensure research data is re-usable? The role of Publishers in Research Data Management, by Catriona MacCallum
1. How can we ensure that
data is reusable?
The role of Publishers in Research Data
Management
LEARN 2nd Workshop, Vienna
Catriona MacCallum, Senior Advocacy Manager PLOS, Consulting
Editor, PLOS ONE
Member of the Boards OASPA, OpenAire
April 2016
3. PLOS ONE
• Multi-disciplinary
• Online only
• Open access (CC BY)
• Large, independent editorial board (>6000)
• Manuscripts assessed only on the rigour of the science,
not the novelty/scope of the topic
• Enables publication of negative/inconclusive results (&
data)
7. Data Availability
Probability of finding the
data associated with a
paper declined by 17%
every year
Vines, Timothy et al. “The
Availability of Research Data
Declines Rapidly with Article
Age.” Current Biology 24, no. 1
(June 1, 2014): 94–97.
doi:10.1016/j.cub.2013.11.014.
8. What are publishers doing…?
Summary data in the paper
• In tables and figures within the final published article
• Data used to compile figures and tables generally not provided
• In supplementary material
Archiving
• Held by author, journal
• In institutional or other data repositories, at author’s discretion
A plethora of policies and non-policies
• Generally journal specific
• Sometimes publisher specific
• Generally not enforced
9. Major Publisher Policies
Wiley, Springer-Nature, Taylor & Francis
• Partnerships with Figshare, Dryad
• Sharing depends on journal policy
• Where required, enforcement by journal editors
Elsevier
• Encourages data sharing but no explicit partnerships
• Reuse depends on licence of article
• Testing which licences should be applied
• Open Data Pilot
• Hosting research data on Science Direct (CC BY)
Society Publishers?
• Journal/Discipline specific
11. Licences
Data in papers subject to the same copyright and licence as the
paper
• Subscription journals restrict access
• Some journals have a different licence for supplementary
information (Nature?)
Open Access licences vary
• Many hybrid articles restrict commercial re-use
• Most OA publishers apply CC BY
Bespoke licences
• STM association licenses
• Repository specific licences
Mostly incompatible with Text & Data Mining
12. PLOS data policy
PLOS journals require authors to make all data underlying
the findings described in their manuscript fully available
without restriction, with rare exception.
When submitting a manuscript online, authors must
provide a Data Availability Statement describing
compliance with PLOS's policy. If the article is accepted
for publication, the data availability statement will be
published as part of the final article.
Since March 3, 2014
13. DAS
NB The DAS is openly available, and machine-readable as part of
the PLOS search API
15. Data Availability
PLOS Data Availability Policy
Define compliance
In 2015: ~95% of PLOS ONE
papers have Data Availability
Statement
But what is true compliance?
Tim Vines, Richard Van Noorden
16. Anecdotes & Interpretation
‘Mandated data archiving greatly improves access
to research data’ T. H Vines et al. Faseb J 27,
1304-1308; Jan 2013
Source: ‘Confusion over publisher’s pioneering
open-data rules’ Nature 515, 478 (27 November
2014) doi:10.1038/515478a
50 fMRI studies in PLOS ONE1
38 had shared the data
12 had not shared the data
(completely anecdotal)
An increase in data sharing2:
- from 12% to 40%
- even up to as much as 76%
Not seeing full compliance but we
are seeing a MASSIVE improvement
17. Where are the Data (PLOS)
Time Papers with DAS
Data in
Submission Files
(#)
Data in
Submission
Files (%)
Data in
Repositories
(Estimate)
Data upon
Request
(Estimate)
Q2-Q4 2014 9491 7918 74% 11% 10%
Q2-Q4 2015 22142 15382 69% 14% 12%
Dryad Figshare NCBI Github
Q2-Q4 2014 152 210 551 37
Q2-Q4 2015 551 753 1229 174
Percent
change
50% increase 54% increase 8% decrease <1% consistent
DAS = Data availability statement
18. Data checks on PLOS ONE
Contractor (‘Editorial Office’) does initial check
• Flags any instance of not being able to share the data publicly
• sends author a detailed request (template) with the decision
letter.
• Escalates further concerns to PLOS Staff
14 Publication Assistants work on escalated data related issues
• During peer-review and final checks.
• Currently amounts to two full time staff
• Problem papers escalated to internal editorial staff
• 1-2 a month (but time intensive)
• Post-publication concerns raised by readers
• So far 36 corrections
• 4 republications for identifying patient information
19. Major issues (PLOS)
• Most papers say data are within the paper and SI files
but...
• Patient IDi info: we check all clinical SI files on
acceptance, but still too many papers reveal information
• Cohort/Consortia/Multi-institutional and Multi-national
studies:
• Many have steering committees that do not permit public
deposition of the data
• restrictions are not always for ethical or legal reasons and
some of these groups also require authorship for access to
data.
20. PLOS Data Policy
OTHER ISSUES ENCOUNTERED ALONG THE WAY
• Concern re early sharing & scooping.
• How much data checking should editors/reviewers do?
• Which data are actually required?
• Lack of or inconsistent community standards
• Which repositories?
• Un-extractable data, proprietary file-types.
• Tension between patient privacy issues and data sharing
• Fibbing authors
• Field specific differences about what’s acceptable
21. Challenges
QUESTIONS WE DON’T KNOW ANSWERS TO YET
• Treatment of software/code
• How should materials sharing differ
• What to do with big data?
• Do we need better/more aligned consenting for patient studies?
• Best practices for data access committees?
• How to fund data access committees?
• Preservation of obsolete formats?
• How to cite data & credit data reuse?
Michael Carroll. PLOS Biology 2015. Sharing Research Data and Intellectual Property Law: A
Primer http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002235
23. On January 7, 2016, a coalition of
publishers sign an Open Letter committing
to start requiring ORCID IDs in 2016.
1. Implementing best practices for
ORCID collection and auto-update of
ORCID records upon publication
2. Require ORCID IDs for corresponding
authors and encourage for co-authors
24. CRediT – Contributor Roles Taxonomy
A simple taxonomy of research
contributions developed under the
auspices of CASRAI and NISO.
- Includes but not limited to
traditional authorship roles
- Makes contributions machine-
readable and portable
- Meant to inspire development:
Mozilla badges, VIVO-ISF
ontology, JATS integration,
ORCID integration
25. The CRediT taxonomy is by design simple, which may become limiting,
but it provides an important framework for authorship discussions.
Ideal solution:
* includes a free text field for each contribution
* can be used upstream from submission, during research
26. Persistent identifiers and metadata
• Data citation not standard practice
• Inability to link data to papers
• No separate identifiers for figures, tables,
supplementary material etc
• Low adoption of persistent identifiers for Researchers
• Persistent identifiers for Funders & Institutions in flux
27. The ecosystem of persistent identifiers is
growing
Contributions in a machine readable format can enrich this ecosystem
DOI
DOI accession #DOI
28. OPEN CITATIONS will create services for authors
e.g. linking EU PMC’s Open Citations to an ORCID iD
29.
30. THOR
• EU funded project (~Euro 2 million)
• A consortium of partners
• British Library ORCID, Datacite, CERN, EMBL-EBI,
Pangaea, Australian National Data Service, Dryad,
PLOS, Elsevier
• Project Duration is 30 months (~June 2015-Dec 2017)
http://project-thor.eu/
31. Data Citation (1):
credit for data producers and collectors
• Should comply with Force11 Data Citation Principles
• Minimum Requirements
• author names, repository name, date + persistent unique
identifier (such as DOI or URI)
• citation should link to the dataset directly via the
persistent identifier
• comprehensive, machine-readable landing pages for
deposited data
• guidance to authors to include data in references
https://www.force11.org/group/joint-declaration-data-citation-principles-final
32. Data Citation (2):
challenges
• Only ‘approved’ repositories?
• In main reference list or separate?
• Distinguish between data produced in the study versus
reuse of data produced elsewhere?
• Datasets that are continuously updated
• Data citations not tagged by publishers (JATS)
• Data citation metrics
• Software citation
34. Protocols.io
• Data base of experimental protocols
• Open access and free for users
• Desktop and mobile applications
• Functionality to
• Create
• Fork – create derivatives (keeps provenance)
• Run
• Annotate while running
• Keep date-stamped version of actual run
• Export to PDF, etc
10k registrants
1,000 private protocols
38. Retractions
• Retraction of a research article is a complete and
permanent removal from the scientific record
• Although an article remains accessible to readers, it
should no longer be cited.
• Reasons for retraction:
• Conclusions cannot be relied upon and are no longer
supported (invalid results)
• Serious breach of research or publication ethics
39. Why are papers retracted?
Van Noorden, Nature 478, 26-28 (2011)
40. September 2009
Elizabeth Wager, Virginia Barbour, Steven Yentis, Sabine Kleinert on behalf of COPE
Council
“The main purpose of retractions is to correct
the literature and ensure its integrity rather than
to punish authors who misbehave.”
41. Process and best practices
• Retractions have their own
DOI.
• Permanent bi-directional
linking & clear marking of the
paper – syndicate to indexers.
• For corrections, clear
indication if the paper was
republished
CrossRef industry-wide initiative to
provide a standard way for
readers to locate the most up-to-
date version of an article.
PLOS has adopted
What about data retractions?
42. Editorial and peer review evaluation
Editorial office:
• Trial registration
• Data deposition
• Reporting guidelines
• Ethical approval
• Competing interests
• Financial disclosures
• Permissions
• Plagiarism
• Image integrity
Peer reviewers:
• Methodology and
experimental design
• Analysis
• Statistics
• Conclusions
• Ethics
Limitations:
- Science has become more cross-disciplinary
- Confidential peer review can show biases
- Paper should live on after publication
- Still not enough incentives to publish all results
43. False expectations
Peer review is expected to police the literature but:
• Science has become more cross disciplinary and more
complicated (mammoth datasets)
• Is 2 or 3 reviewers + 1 editor sufficient?
• Anonymity conceals/engenders negativity and bias
• No incentive/reward for constructive collaboration
• Reviewers review for journals and editors – not for
readers, colleagues or society
• Peer review is a black box – impossible to assess its
effectiveness
44. Is science reliable ?
• Poorly Designed studies
• small sample sizes, lack of randomisation, blinding
and controls
• Data not available to scrutinise/replicate
• ‘p-hacking’ (selective reporting) widespread1
• Poorly reported methods & results2
• Negative results are not published
1Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD (2015) The Extent
and Consequences of P-Hacking in Science. PLoS Biol 13(3): e1002106.
doi:10.1371/journal.pbio.1002106
2Landis SC, et al. (2012) A call for transparent reporting to optimize
the predictive value of preclinical research. Nature 490(7419):
187–191.
46. Does prestige ensure ‘quality’
• Higher ranked journals have more papers retracted1
• Papers in higher ranked journals are more likely to
report either no or inappropriate statistics2,3
• Papers from highly ranked institutions have poorer
reporting standards3
1Fang, Ferric C., and Arturo Casadevall. “Retracted Science and the Retraction
Index.” Infection and Immunity 79, no. 10 (October 1, 2011): 3855–59.
doi:10.1128/IAI.05661-11.
2Tressoldi PE, Giofre D, Sella F, Cumming G. High impact = high statistical standards?
Not necessarily so. PLoS One 2013; 8(2):e56180. doi: 10.1371/journal.pone.0056180
PMID: 23418533
3 Macleod MR, et al. (2015) Risk of Bias in Reports of In Vivo Research: A Focus for
Improvement. PLoS Biol 13(10): e1002273. doi:10.1371/journal.pbio.1002273
47. “Current incentive structures in science are likely to lead rational
scientists to adopt an approach to maximise their career
advancement that is to the detriment of the advancement of
scientific knowledge. “
Andrew Higginson and Marcus
Mufano, in prep (cited with their
permission)
48. • Researchers gain from publishing in ‘designer’ journals
• Journals gain financially from their brand/ Journal
Impact factor
• Institutions gain financially by hiring and firing based on
where researchers publish, not on what they publish (or
the mission of the University)
• Research assessment by funders often based on very
few publications and brand/impact factor (some are
changing)
49. Declaration on Research
Assessment
• A worldwide initiative, spearheaded
by the ASCB (American Society for
Cell Biology), together with scholarly
journals and funders
• Focuses on the need to improve the
way in which the outputs of scientific
research are evaluated:
• the need to eliminate the use of
journal-based metrics, such as Journal
Impact Factors, in funding,
appointment, and promotion
considerations;
• “need to assess research on its own
merits rather than on the basis of the
journal in which the research is
published”
54. By the time a paper is submitted to a journal it’s
generally too late
55. • Lack of incentives for authors to share data and software, or to
transparently report details of data, methods etc within articles
• Different data sharing expectations among co-authors in receipt of grants
from different funders or locations or disciplines.
• The absence of a culture/lack of education within institutions to put data
management and archiving at the centre of good lab practice.
• Lack of any coherent infrastructure (e.g. repositories, metadata standards).
• Licensing chaos (implications for Text and Data Mining).
• No clear definition of what the data underlying the paper means
• Lack of enforcement by different stakeholders in the chain (funders,
institutions, publishers)
• No means of reporting compliance.
56. • Align policies between funders, publishers, institutions
• Reduce the burden on researchers
• Incentivise all players (sticks and carrots)
• Monitor progress towards common goals
• Create global community standards for open science
• Define ‘Open Science’
• COPE, TOP guidelines, Leiden Manifesto, HEFCE report on metrics
• Build the infrastructure to support open science
• Interoperable publicly available platforms
• New submission and reviewing tools that foster openness, transparency
and collaboration
• The means to track, link & assign credit to all types of outputs
• Persistent identifiers for researchers, funders, institutions, licences etc -ORCID,
FundRef, DataCite (DOIs for data) etc
• Apply the scientific method to scholarly communication itself
• ‘Evidence-based’ policy
• Publically available data on metrics, indicators, evaluation
• Independent scrutiny