Research Data (and Software) Management at Imperial: (Everything you need to know to gain impact for your work!)

Library
Services
Research Data (and Software)
Management at Imperial*
(*Everything you need to know to gain impact for your
work!)
HPC Summer School
Research Data Management
Community Session
Sept. 20th, 2017
Sarah Stewart, Research Data Management Team,
Central Library

Outline
1.) What is RDM and how does it apply to software?
2.) The RDM workflow: Tools and processes at Imperial
3.) Plan: Software Management
4.) Store and Archive: the importance of metadata
5.) Publishing and Discovery: Metadata strikes again!
6.) Conclusion/Questions

What is Research Data Management?
Research Data Management is part of the research
process, and aims to make the research process as
efficient as possible, and meet expectations and
requirements of the university, research funders,
and legislation.

Funder requirements…
“Publicly funded research data are a public good,
produced in the public interest, which should be made
openly available with as few restrictions as
possible…”
RCUK Common Principles on Data Policy

The Strong Case for RDM
• Intensive Data-Generating Research Hubs = ‘Big Data’
• UK Med Bio - Bioinformatics Data Science Group – research into causes
and progression of human diseases.
• NHS Trust Research Data (Medicine)
• Research Computing Group and Research Software Engineering
Community
• But also many important ‘small data’ projects across College.

Why spend time on RDM?
• It is not a distraction from ‘real work’.
• You can work effectively and efficiently.
• Save time and reduce frustration in the future.
• Set systems that work for you.

Missing Data (and Software)
In their parents' attic, in boxes in the garage, or stored on now-defunct
floppy disks — these are just some of the inaccessible places in which
scientists have admitted to keeping their old research data.”
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-
1.14416

Software: What do I do with it?
• Lots of emphasis on ‘data’ management, but software in
research is often neglected.
• Software is sensitive to changes in its ‘environment’
• There is a lot of variation inherent in software
(languages, versions, licensing, etc.)

Software as ‘Data’
• ‘Software is used to create, interpret, present,
manipulate and manage data’ (Software Sustainability
Institute)
• Data: ‘recorded factual material commonly retained by
and accepted…as necessary to validate research
findings’ (EPSRC)
• Software = Data!

Treat software as valuable research output
PyRDM Green Shoots project
Zenodo integrates with GitHub
College survey on distributed version control
Software Sustainability Institute – I a fellow

RDM Infrastructure
Data
Access
Statement

Who are we?
Helping the Imperial community to
communicate and disseminate their research
and academic work.

What is in a Data Management Plan?

PLAN - Data Management Plans
A Data Management Plan is a document that is created in the early
stages of a project that:
• Helps you consider all aspects up front
• Should be useful for you
• Should be kept up to date
An initial plan may be expanded later but should provide details about:
• Plans and expectations for data
• The nature of data and its creation or acquisition
• Storage and security
• Preservation and sharing

Data Management Plans: DMP Online

Software Management Plans
What?
• Like Data Management Plans, Software management
plans provide an outline of uses, responsibilities,
ownership, access and sharing, storage, maintenance
and archiving of research software.

Why?
• No clear funder requirements yet, but…
• Promotes citability and credit for your research =
Increased Research Impact
• Research Output can be validated/checked by others
• Supports transparency of research and promotes Open
Research.
• Good practice!

How? (at Imperial College London):
• Specialised template in DMPOnline (via DCC)
• Imperial-specific DMPOnline template (in development).
• Use GitHub (Imperial has an enterprise account)
• Use Zenodo or another subject-specific repository to archive
versions of research software (GitHub integration)
• Log metadata about your software into Symplectic.
• Contact RDM Team (Central Library) for assistance/support: rdm-
enquiries@imperial.ac.uk

Live Data Storage: Box (and Others)
• Box for live data storage (non-sensitive) and data
sharing
• Sensitive data storage via ICT secure storage and
encryption
• Specialist data storage, eg. Omero in Bioinformatics
Data Science Group for light microscopy images
• Research Computing Repository
• Imperial GitHub for Software and code

File naming, storing and retrieving

Backups…
• The ‘3-2-1 principle’: always have
at least 3 copies…
…on at least 2 different media…
…with at least 1 off-site
• ‘LOCKSS’ – Lots Of Copies Keeps Stuff Safe
• Never trust a backup you’ve never tested
• Where possible, let ICT/department/faculty handle
this

Archiving and preserving data
Most research now has a requirement to preserve data for
at least 10 years in most cases.
This is to:
• Enable future work
• Support integrity of published findings
Need to consider:
• What should be kept
• What format to keep it in
• Where to keep it

Software should be preserved if:
• Software can’t be separated from the data or digital
object.
• Software is classified as a research output
• Software has intrinsic value

Digital Preservation Issues
• Storage, Retrieval, Reconstruction and Replay are all
complexities relating to code libraries, dependencies and
software engineering overall.
• Planning is essential for subsequent retrieval,
reconstruction and replay.
• Software is a digital object which is frequently the result
of research and is often a vital prerequisite for the
preservation of other digital objects.
• Software preservation should be part of a broader
preservation strategy: Research Data Management.

Strategies for Digital Preservation
• Data Integrity and File Fixity checks (management of
checksums) – for source code
• Media and Format Migrations
• Refreshing (reduces bit-rot)
• Replication (create duplicate copies, avoids corruption,
loss, erasure)
• Emulation
• Encapsulation (linking content with all information
required for it to be deciphered and understood)

Zenodo
https://zenodo.org
Research. Shared. — all research outputs
from across all fields of research are
welcome!
Citeable. Discoverable. — uploads gets a
Digital Object Identifier (DOI) to make them
easily and uniquely citeable.
Communities — create and curate your own
community. Your own complete digital
repository!
Funding — identify grants, integrated in
reporting lines for research
Flexible licensing — because not everything
is under Creative Commons.
Safe — your research output is stored safely
for the future in the same cloud infrastructure
as CERN's own LHC research data.

Archiving Data ‘without a Repository?’
• Data is archived in Zenodo
or in UK Data Service
(sensitive data) post-
project
• Software and code
archived in Zenodo via
GitHub
• Metadata from Data and
Software are deposited
into Spiral via Symplectic
• Indexed by DataCite and
CrossRef

The importance of Metadata
• Ensure correct metadata is
used in order to facilitate
discovery – good metadata
should be findable through
both machine and human
searches.
• Ensure metadata is added
following accepted standards
(eg. following DCC Metadata
Standards guide:
http://www.dcc.ac.uk/resource
s/metadata-standards/list)

Publishing: The Data (and Software) Access
Statement
“Published results should always include information
on how to access the supporting data.”
— RCUK Common Principles on Data Policy
Include a statement in all publications stating:
- How/where the underlying data can be obtained
- What restrictions/terms apply

Why share data and software?
Build research
profile
Demonstrate
validity of
results
Contribute to
the
community
Because you
must
(sometimes)

Share information about your data and software
• You can now share information about your data and software in the College
publications repository ‘Spiral’ via a form on Symplectic.

Metadata Strikes Again!
• Ensure that you have good quality metadata present in
order to make your software and data findable,
accessible, (and also interoperable and reusable)

Can’t share your data and software?
• Because it’s sensitive/confidential:
- Share an anonymous version
- Share summary statistics
- Deposit in the UK Data Service Secure Lab
- Require users to sign a Data Sharing Agreement
• Because it’s not relevant to anyone else:
- Actually, you’d be surprised…
• Because it’s too much work to prepare:
- Document and organise it as you go along
- The up-front effort will make your future work easier too

Guidance for licensing
‘How to License Research Data’
A guide from the Digital Curation Centre
http://www.dcc.ac.uk/resources/how-guides/license-research-data

Licensing for Software
- Various open-source software licenses: eg. MIT, GNU,
Apache, Mozilla Public, etc.
- https://www.software.ac.uk/tags/licensing
- https://opensource.org/licenses

ORCID – Open Researcher and Contributor ID
•Emerging global standard for identifying authors of academic outputs
•The College created ORCID iDs for academics staff in late 2014
(now 2,088 of 3,200 iDs claimed, ~1,500 linked in Elements)
•Imperial hosted launch of Jisc ORCID consortium with
50 UK universities in September 2015
http://www.imperial.ac.uk/orcid

In Summary…
• Planning: Use DMPOnline to draft a software
management plan.
• Storage: Use GitHub and/or Box to store active software
• Archiving: Use Zenodo, GitHub or another subject-
specific repository to preserve your software
• Discovery: Make your software discoverable in Spiral via
Symplectic
• Publishing: Include a Data Access Statement in your
published article stating where your software can be
found and how it can be accessed and used

Any Questions?
Thank you!
For more information and support:
Webpage: www.imperial.ac.uk/research-data-management
E-mail: rdm-enquiries@imperial.ac.uk
And also:
DMPOnline: https://dmponline.dcc.ac.uk/
Software Sustainability Institute:
https://www.software.ac.uk/

Research Data (and Software) Management at Imperial: (Everything you need to know to gain impact for your work!)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Research Data (and Software) Management at Imperial: (Everything you need to know to gain impact for your work!)

Similaire à Research Data (and Software) Management at Imperial: (Everything you need to know to gain impact for your work!) (20)

Plus de Sarah Anna Stewart

Plus de Sarah Anna Stewart (9)

Dernier

Dernier (20)

Research Data (and Software) Management at Imperial: (Everything you need to know to gain impact for your work!)

Notes de l'éditeur