A presentation on research data management tools, workflows and best practices at Imperial College London with a focus on software management. Presented at the 2017 session of the HPC Summer School (Dept. of Computing).
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Research Data (and Software) Management at Imperial: (Everything you need to know to gain impact for your work!)
1. Library
Services
Research Data (and Software)
Management at Imperial*
(*Everything you need to know to gain impact for your
work!)
HPC Summer School
Research Data Management
Community Session
Sept. 20th, 2017
Sarah Stewart, Research Data Management Team,
Central Library
2. Outline
1.) What is RDM and how does it apply to software?
2.) The RDM workflow: Tools and processes at Imperial
3.) Plan: Software Management
4.) Store and Archive: the importance of metadata
5.) Publishing and Discovery: Metadata strikes again!
6.) Conclusion/Questions
3. What is Research Data Management?
Research Data Management is part of the research
process, and aims to make the research process as
efficient as possible, and meet expectations and
requirements of the university, research funders,
and legislation.
4. Funder requirements…
“Publicly funded research data are a public good,
produced in the public interest, which should be made
openly available with as few restrictions as
possible…”
RCUK Common Principles on Data Policy
5. The Strong Case for RDM
• Intensive Data-Generating Research Hubs = ‘Big Data’
• UK Med Bio - Bioinformatics Data Science Group – research into causes
and progression of human diseases.
• NHS Trust Research Data (Medicine)
• Research Computing Group and Research Software Engineering
Community
• But also many important ‘small data’ projects across College.
6. Why spend time on RDM?
• It is not a distraction from ‘real work’.
• You can work effectively and efficiently.
• Save time and reduce frustration in the future.
• Set systems that work for you.
7. Missing Data (and Software)
In their parents' attic, in boxes in the garage, or stored on now-defunct
floppy disks — these are just some of the inaccessible places in which
scientists have admitted to keeping their old research data.”
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-
1.14416
9. Software: What do I do with it?
• Lots of emphasis on ‘data’ management, but software in
research is often neglected.
• Software is sensitive to changes in its ‘environment’
• There is a lot of variation inherent in software
(languages, versions, licensing, etc.)
11. Software as ‘Data’
• ‘Software is used to create, interpret, present,
manipulate and manage data’ (Software Sustainability
Institute)
• Data: ‘recorded factual material commonly retained by
and accepted…as necessary to validate research
findings’ (EPSRC)
• Software = Data!
12. Treat software as valuable research output
PyRDM Green Shoots project
Zenodo integrates with GitHub
College survey on distributed version control
Software Sustainability Institute – I a fellow
17. PLAN - Data Management Plans
A Data Management Plan is a document that is created in the early
stages of a project that:
• Helps you consider all aspects up front
• Should be useful for you
• Should be kept up to date
An initial plan may be expanded later but should provide details about:
• Plans and expectations for data
• The nature of data and its creation or acquisition
• Storage and security
• Preservation and sharing
19. Software Management Plans
What?
• Like Data Management Plans, Software management
plans provide an outline of uses, responsibilities,
ownership, access and sharing, storage, maintenance
and archiving of research software.
20. Software Management Plans
Why?
• No clear funder requirements yet, but…
• Promotes citability and credit for your research =
Increased Research Impact
• Research Output can be validated/checked by others
• Supports transparency of research and promotes Open
Research.
• Good practice!
21. Software Management Plans
How? (at Imperial College London):
• Specialised template in DMPOnline (via DCC)
• Imperial-specific DMPOnline template (in development).
• Use GitHub (Imperial has an enterprise account)
• Use Zenodo or another subject-specific repository to archive
versions of research software (GitHub integration)
• Log metadata about your software into Symplectic.
• Contact RDM Team (Central Library) for assistance/support: rdm-
enquiries@imperial.ac.uk
22. Live Data Storage: Box (and Others)
• Box for live data storage (non-sensitive) and data
sharing
• Sensitive data storage via ICT secure storage and
encryption
• Specialist data storage, eg. Omero in Bioinformatics
Data Science Group for light microscopy images
• Research Computing Repository
• Imperial GitHub for Software and code
24. Backups…
• The ‘3-2-1 principle’: always have
at least 3 copies…
…on at least 2 different media…
…with at least 1 off-site
• ‘LOCKSS’ – Lots Of Copies Keeps Stuff Safe
• Never trust a backup you’ve never tested
• Where possible, let ICT/department/faculty handle
this
25. Archiving and preserving data
Most research now has a requirement to preserve data for
at least 10 years in most cases.
This is to:
• Enable future work
• Support integrity of published findings
Need to consider:
• What should be kept
• What format to keep it in
• Where to keep it
26. Software should be preserved if:
• Software can’t be separated from the data or digital
object.
• Software is classified as a research output
• Software has intrinsic value
27. Digital Preservation Issues
• Storage, Retrieval, Reconstruction and Replay are all
complexities relating to code libraries, dependencies and
software engineering overall.
• Planning is essential for subsequent retrieval,
reconstruction and replay.
• Software is a digital object which is frequently the result
of research and is often a vital prerequisite for the
preservation of other digital objects.
• Software preservation should be part of a broader
preservation strategy: Research Data Management.
28. Strategies for Digital Preservation
• Data Integrity and File Fixity checks (management of
checksums) – for source code
• Media and Format Migrations
• Refreshing (reduces bit-rot)
• Replication (create duplicate copies, avoids corruption,
loss, erasure)
• Emulation
• Encapsulation (linking content with all information
required for it to be deciphered and understood)
30. Zenodo
https://zenodo.org
Research. Shared. — all research outputs
from across all fields of research are
welcome!
Citeable. Discoverable. — uploads gets a
Digital Object Identifier (DOI) to make them
easily and uniquely citeable.
Communities — create and curate your own
community. Your own complete digital
repository!
Funding — identify grants, integrated in
reporting lines for research
Flexible licensing — because not everything
is under Creative Commons.
Safe — your research output is stored safely
for the future in the same cloud infrastructure
as CERN's own LHC research data.
31. Archiving Data ‘without a Repository?’
• Data is archived in Zenodo
or in UK Data Service
(sensitive data) post-
project
• Software and code
archived in Zenodo via
GitHub
• Metadata from Data and
Software are deposited
into Spiral via Symplectic
• Indexed by DataCite and
CrossRef
32. The importance of Metadata
• Ensure correct metadata is
used in order to facilitate
discovery – good metadata
should be findable through
both machine and human
searches.
• Ensure metadata is added
following accepted standards
(eg. following DCC Metadata
Standards guide:
http://www.dcc.ac.uk/resource
s/metadata-standards/list)
33. Publishing: The Data (and Software) Access
Statement
“Published results should always include information
on how to access the supporting data.”
— RCUK Common Principles on Data Policy
Include a statement in all publications stating:
- How/where the underlying data can be obtained
- What restrictions/terms apply
34. Why share data and software?
Build research
profile
Demonstrate
validity of
results
Contribute to
the
community
Because you
must
(sometimes)
35. Share information about your data and software
• You can now share information about your data and software in the College
publications repository ‘Spiral’ via a form on Symplectic.
36. Metadata Strikes Again!
• Ensure that you have good quality metadata present in
order to make your software and data findable,
accessible, (and also interoperable and reusable)
37. Can’t share your data and software?
• Because it’s sensitive/confidential:
- Share an anonymous version
- Share summary statistics
- Deposit in the UK Data Service Secure Lab
- Require users to sign a Data Sharing Agreement
• Because it’s not relevant to anyone else:
- Actually, you’d be surprised…
• Because it’s too much work to prepare:
- Document and organise it as you go along
- The up-front effort will make your future work easier too
38. Guidance for licensing
‘How to License Research Data’
A guide from the Digital Curation Centre
http://www.dcc.ac.uk/resources/how-guides/license-research-data
39. Licensing for Software
- Various open-source software licenses: eg. MIT, GNU,
Apache, Mozilla Public, etc.
- https://www.software.ac.uk/tags/licensing
- https://opensource.org/licenses
40. ORCID – Open Researcher and Contributor ID
•Emerging global standard for identifying authors of academic outputs
•The College created ORCID iDs for academics staff in late 2014
(now 2,088 of 3,200 iDs claimed, ~1,500 linked in Elements)
•Imperial hosted launch of Jisc ORCID consortium with
50 UK universities in September 2015
http://www.imperial.ac.uk/orcid
41. In Summary…
• Planning: Use DMPOnline to draft a software
management plan.
• Storage: Use GitHub and/or Box to store active software
• Archiving: Use Zenodo, GitHub or another subject-
specific repository to preserve your software
• Discovery: Make your software discoverable in Spiral via
Symplectic
• Publishing: Include a Data Access Statement in your
published article stating where your software can be
found and how it can be accessed and used
42. Any Questions?
Thank you!
For more information and support:
Webpage: www.imperial.ac.uk/research-data-management
E-mail: rdm-enquiries@imperial.ac.uk
And also:
DMPOnline: https://dmponline.dcc.ac.uk/
Software Sustainability Institute:
https://www.software.ac.uk/
Notes de l'éditeur
- we’ll be talking more about funder requirements later in the session
This is current for now, but policies do change, so keep up to date with what your funder, institution or publisher require.
Most of us find that we have many calls on our time, and that packing everything that needs to be done into the week is often a challenge. That being the case, it’s easy to feel as though research data management is simply one more thing to add to an already endless to-do list – or worse, that it’s a distraction from real work. However, there are a number of key reasons that it’s worth paying some attention to it.
Good data management does require an investment of effort – but ultimately it’s something that can actually save you time, by helping you work more efficiently. You want to complete your research project to the best of your ability, but with minimum stress – and good research data management is one of the tools that can help you to do that.
Think about:
the frustration of trying to track down a fact or a document we know we have somewhere. Good research data management – setting up an organizational system that works for you, and ensuring everything is properly filed or labelled to enable re-identification and retrieval – can make life a lot easier.
And it’s not just a matter of saving time and reducing unnecessary effort (though clearly that’s a major benefit): having everything well ordered can also help you get a better feel of the shape and scope of your research material, which in turn can enable you to spot patterns or connections that might otherwise get missed.
It’s also well worth doing, because the data you’re producing or working with is valuable
As well as this being true for your own research, the data might ultimately be of use to other researchers. Having everything well organized and properly labelled also has the potential to save you a lot of time at the end of a research project, when it comes to deciding what to do with your data – but more of that later.
Finally, there may be requirements imposed by your funding body and/or the university which you need to meet
Ash
It is worth thinking of a DMP as a live document rather than a static one that you complete and never think about again. Details may change as you work through your research and it's important to keep your DMP updated.
A good time to point out that you’re not just dealing with computer files (lab notebooks, photographs, recorded interviews etc) so by, for example, scanning a notebook, you’re creating a back-up. If you have a transcript of the interview, that’s also another back-up. Just remember – you need three copies !
We have an ICT team here who are the experts in backing up – if you feel you need extra help or reassurance about the back-ups of your files/data, please speak to them.
Thinking about the 1st pillar of information security - Availability - it may be a requirement from your funder that you make the data, or at the very least, information about the data, underpinning your research publicly available. To ensure that this happens you need to think about preservation.
There are a number of places you can store you completed research data – that is, the data that directly underpins your published outputs – in this case, your thesis.
Safe place – not the garage!
A common question among researchers is ‘why would I share my data?!’.
Let’s start with sometimes you must. In the video the panda asked to have access to the bear’s research because it was NIH funded and appeared in Science, which meant that he was under obligation to make his data accessible. But then there are reasons for it being just good science to share the data.
Sharing data can build your reputation in number of ways. Laying your work open to scrutiny means that you will get credit for high quality research, increased understanding of your methods and allowing your work to be verified by others. Sharing allows you to make a greater contribution to your community – and to be recognized for doing so. It can also help extend your reputation beyond that community.
There is also substantial evidence that making your data openly available leads to increased citations – of the datasets themselves, and of the papers or other publications based on the data.
A major change is happening within academia at the moment. Data outputs are being viewed as increasingly important, and this trend is only likely to continue - for example, major journals are increasingly looking to publish (or provide access to) datasets alongside the articles reporting on and interpreting the data.
The College will want to deposit your finished thesis in our publications repository and knowing ahead of time how to make your underpinning data available can help to add value to your work. We are already beginning to hear about collaborations that have been borne out of researchers accessing underpinning data.
There may be occasions when you think that you can't share your data.
There will also be some data licencing guidance on the College RDM webpages