| www.eudat.eu | The presentation gives an introduction to Research Data Management, explaining why it is important to manage and share data.
November 2016
1. EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 www.eudat.eu
Research Data Management
Version 2
August 2016
This work is licensed under the Creative
Commons CC-BY 4.0 licence
2. The changing data landscape
Managing and sharing research data
EUDAT services
Overview
3. THE CHANGING DATA LANDSCAPE
Image CC-BY-SA ‘data.path Ryoji.Ikeda - 3’ by r2hox www.flickr.com/photos/rh2ox/9990016123
4. Data explosion
More and more data is
being created
Issue is not creating data,
but being able to navigate
and use it
Data management is
critical to make sure data
are well-organised,
understandable and
reusable
Image by ‘Coupmedia’ by http://www.coupmedia.com/resources/
5. Digital data are fragile and susceptible to loss for a wide variety of reasons
Natural disaster
Facilities infrastructure failure
Storage failure
Server hardware/software failure
Application software failure
Format obsolescence
Legal encumbrance
Human error
Malicious attack
Loss of staffing competencies
Loss of institutional commitment
Loss of financial stability
Changes in user expectations
Data loss
Image CC-BY ‘Hard Drive 016’ by Jon Ross www.flickr.com/photos/jon_a_ross/1482849745
6. Link rot – more 404 errors
generated over time
Reference rot* – link rot
plus content drift i.e.
webpages evolving and no
longer reflecting original
content cited
* Term coined by Hiberlink http://hiberlink.org
Data persistency issues
Jonathan D. Wren Bioinformatics 2008;24:1381-1385
7. A reproducibility crisis
Nature special issue
http://www.nature.com/news
/reproducibility-1.17552
Several studies have shown
alarming numbers of
published papers that don’t
stand up to scrutiny
8. A wildlife biologist for a small field office was the in-house GIS expert
and provided support for all the staff’s GIS needs. However, the data
was stored on her own workstation. When
the biologist relocated to another office, no one understood how
the data was stored or managed.
Solution: A state office GIS specialist retrieved the workstation
and sifted through files trying to salvage relevant data.
Cost: 1 work month ($4,000) plus the value of data that was not
recovered
Consider that the situation could have been worse, because the data
was not being backed up as it would have been if stored on a server.
Poor data management - science example
9. In preparation for a Resource Management Plan, an office
discovered 14 duplicate GPS inventories of roads.
However, because none of the inventories had enough
metadata, it was impossible to know which inventory was
best or if any of the inventories actually met their
requirements.
Solution: Re-Inventory roads
Cost: Estimated 9 work months
per inventory @$4,000/wm
(14 inventories = $504,000)
Poor data management - federal example
Image CC-BY ‘Minature fake highway interchange in Chicago’ by Ryan www.flickr.com/photos/ryanready/4692092024
10. Why manage research data?
To make your research easier!
To stop yourself drowning in irrelevant stuff
In case you need the data later
To avoid accusations of fraud or bad science
To share your data for others to use and learn from
To get credit for producing it
Because funders or your organisation require it
Well-managed data opens up opportunities for re-
use, integration and new science
11. MANAGING & SHARING DATA
Image CC-BY-SA by https://www.flickr.com/photos/notbrucelee/8016192302
12. CREATING
DATA
PROCESSING
DATA
ANALYSING
DATA
PRESERVING
DATA
GIVING
ACCESS TO
DATA
RE-USING
DATA
Research data lifecycle
CREATING DATA: designing research,
DMPs, planning consent, locate existing
data, data collection and management,
capturing and creating metadata
RE-USING DATA: follow-
up research, new
research, undertake
research reviews,
scrutinising findings,
teaching & learning
ACCESS TO DATA:
distributing data,
sharing data,
controlling access,
establishing copyright,
promoting data PRESERVING DATA: data storage, back-
up & archiving, migrating to best format
& medium, creating metadata and
documentation
ANALYSING DATA:
interpreting, & deriving
data, producing outputs,
authoring publications,
preparing for sharing
PROCESSING DATA:
entering, transcribing,
checking, validating and
cleaning data, anonymising
data, describing data,
manage and store data
Ref: UK Data Archive: http://www.data-archive.ac.uk/create-manage/life-cycle
14. A DMP is a brief plan to define:
• how the data will be created?
• how it will be documented?
• who will access it?
• where it will be stored?
• who will back it up?
• whether (and how) it will be shared & preserved?
DMPs are often submitted as part of grant applications, but
are useful whenever researchers are creating data.
Data Management Planning
15. Metadata and documentation is needed to locate and
understand research data
Think about what others would need in order to find,
evaluate, understand, and reuse your data.
Get others to check the metadata to improve quality
Use standards to enable interoperability
Metadata and documentation
16. Where to store your data?
Your own drive (PC, server, flash drive, etc.)
– And if you lose it? Or it breaks?
Somebody else’s drive / departmental drive
“Cloud” drive
– Do they care as much about your data as you do?
Large scale infrastructure services like EUDAT
17. How to backup?
3... 2... 1... backup!
– at least 3 copies of a file
– on at least 2 different media
– with at least 1 offsite
Use managed services where possible e.g.
University filestores or infrastructure services
like EUDAT rather than local or external hard
drives
Ask IT teams for advice
18. Backup and preservation
– not the same thing!
Backups
o Used to take periodic snapshots of data in case the current
version is destroyed or lost
o Backups are copies of files stored for short or near-long-
term
o Often performed on a somewhat frequent schedule
Archiving
o Used to preserve data for historical reference or potentially
during disasters
o Archives are usually the final version, stored for long-term,
and generally not copied over
o Often performed at the end of a project or during major
milestones
19. A mistake in a spreadsheet led
to dramatically different results
from those published.
These results were cited by
the International Monetary
Fund and the UK Treasury to
justify austerity programmes.
Had the data been shared, this
could have been picked up
earlier.
The importance of sharing data
20. Concerns About Data Sharing
Concern Solution
inappropriate use due to
misunderstanding of research
purpose or parameters
security and confidentiality of
sensitive data
lack of acknowledgement / credit
loss of advantage when competing
for research dollars
21. Concerns About Data Sharing
Concern Solution
inappropriate use due to
misunderstanding of research
purpose or parameters
security and confidentiality of
sensitive data
lack of acknowledgement / credit
loss of advantage when competing
for research dollars
metadata
metadata
metadata
metadata
22. Concerns About Data Sharing
Concern Solution
inappropriate use due to
misunderstanding of research
purpose or parameters
provide rich Abstract, Purpose,
Use Constraints and Supplemental
Information where needed
security and confidentiality of
sensitive data
• the metadata does NOT
contain the data
• Use Constraints specify who
may access the data and how
lack of acknowledgement / credit
specify a required data citation
within the Use Constraints
loss data insight and competitive
advantage when vying for
research dollars
create second, public version with
generalized Data Processing
Description
23. Making data shareable
Create robust metadata that has been checked
Include reference information e.g. unique IDs & properly
formatted data citations
Publish your metadata so it’s discoverable. Use portals,
clearing houses, online resources…
Package up the data and associated metadata to deposit
in repositories
24. Deciding what to preserve and share
It’s not possible to keep everything. Select based on:
What has to be kept e.g. data underlying publications
What can’t be recreated e.g. environmental recordings
What is potentially useful to others
What has scientific, cultural or historical value
What legally must be destroyed
How to select and appraise research data:
www.dcc.ac.uk/resources/how-guides/appraise-select-research-data
25. EUDAT SERVICE SUITE
Image CC-BY-NC ‘Data centre’ by Bob Mical www.flickr.com/photos/small_realm/15995555571
26. EUDAT services
EUDAT offers a pan-European solution, providing a
generic set of services to ensure minimum level of
interoperability
Building common
data services in
close collaboration
with 25+
communities
27. EUDAT B2 service suite
Covering both access and
deposit, from informal data
sharing to long-term
archiving, and addressing
identification, discoverability
and computability of both
long-tail and big data,
EUDAT’s services will
address the full lifecycle of
research data
28. Support throughout the lifecycle
CREATING
DATA
PROCESSING
DATA
ANALYSING
DATA
PRESERVING
DATA
GIVING
ACCESS TO
DATA
RE-USING
DATA
29. www.eudat.eu
Authors Contributors
This work is licensed under the Creative Commons CC-BY 4.0 licence
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures.
Contract No. 654065
Sarah Jones, Digital Curation
Centre
Mark van de Sanden, SURFsara
Thank you
Content has also been repurposed from the DataONE Educational
modules, ‘Data Management’ and ‘Data Sharing’ Retrieved from
https://www.dataone.org/education-modules
Notes de l'éditeur
This presentation will give an introduction to Research Data Management, explaining why it is important to manage and share data.
There are three main topics that we will discuss:
The changing data landscape, looking at what issues this brings.
Secondly, we discuss considerations to make when managing and sharing data
Finally we’ll touch on the EUDAT service suite and how support is provided throughout the lifecycle
So let’s begin by looking at the changing data landscape.
There’s been a data explosion. The amount of data being created now is growing exponentially, so the biggest challenge is being able to navigate and use it. This is why data management is critical.
Digital data are fragile. There are lots of ways in which data can be lost. Hardware and software can fail, formats can become obsolete, you can lose the knowledge and skills needed to understand the data, and you can lose the investment needed to keep the data accessible.
Several studies have also shown issues with data persistence. This graph shows how many broken links there are in a selection of MEDLINE papers. The further back you go, the higher the percentage, and worryingly, the highest percentage is for the most recent papers (abstracts from 2007).
Another issue that occurs is reference rot, where the link still resolves, but the content presented no longer reflects the original content cited as the webpage has been updated.
All of these issues are leading to a reproducibility crisis. Several studies have shown alarming numbers of published papers that don’t stand up to scrutiny. In 2015, Nature released a special issue on this.
There are lots of ways in which data can be poorly managed so let’s look at a couple of examples.
The first one is about a loss of expertise. A wildlife biologist was the in-house GIS expert, but when she relocated to another office, no one understood how the data was stored or managed. They had to bring another specialist in at a cost of 1 month’s work. It could have been worse though as the data were stored on a standalone computer and weren’t being backed up.
You need to manage transitions when staff move on to make sure everything is properly documented so the data are accessible to and understood by others.
The other example comes from government. An office found several duplicate GPS inventories of roads, none of which was properly described. As it wasn’t clear what was most up-to-date and accurate, they had to re-inventory the roads. If data aren’t properly documented, they may become unusable, forcing you to re-create the data. Here the cost was 9 months of work per inventory, so over $500,000
There are lots of reasons to manage research data. Ultimately though, it’s to make your research easier. If data are properly documented and organised, you can stop yourself drowning in irrelevant stuff and find the data when you need it – for example to validate findings. By managing your data you can also more easily share it with others to get more credit and impact. You may also be required to explain how you will manage your data by your funder or university.
Well-managed data opens up opportunities for re-use, integration and new science
Let’s move on to the considerations to make when managing and sharing data
This research data lifecycle is taken from the UK Data Archive. It shows you the different processes and activities you’ll go through.
Creating data: This is when you’ll design the research, write Data Management Plans, negotiate consent agreements, find any existing data you want to reuse, collect/capture your data and create any associated metadata
Processing data: When processing your data, you’ll be entering, transcribing, checking, validating and cleaning it, you may also need to anonymise your data, you should describe it and make sure it’s properly managed and stored.
Analysing data: when you analyse your data you’ll be interpreting it and creating derived data and outputs, you’ll probably also author publications and prepare the data for deposit and sharing.
Preserving data: data repositories play a key role in preserving data: they will make sure it’s properly stored and archived, they will migrate the formats and storage medium and create associated metadata and documentation to explain any changes made
Access to data: it may be that you share your data via a repository or handle access requests yourself. Either way, you need to establish copyright, decide who can have access and promote the data.
Re-using data: data can be re-used in follow-up studies, new research, research reviews, to evidence findings or for teaching and learning. Try to keep an open mind about the different ways in which your data could be re-used and make it as open as possible.
A digital object is a bitstream, with a persistent identifier and associated metadata. The data alone (literally just the bitstream) is meaningless if others can’t find and understand it.
A Data Management Plan is often written early on in the research process to determine what data will be created and how it will be managed. Sometime you are asked for a DMP as part of a grant application, but they are useful to write regardless as it helps to develop consistent procedures from the outset.
Metadata is needed to locate and understand the data. When you are deciding what information to capture, think about what others would need in order to find, evaluate, understand, and reuse your data. Also get others to check your metadata to improve the quality and make sure it’s understandable to others. Standards should be used where possible.
There are lots of places you can store your data. You’re best to use managed services where possible as they’re more resilient. If you store data on standalone computers, memory sticks or in the cloud, be mindful of the risk of loss or security breaches.
If you’re responsible for backing up your own data, you want to ensure there are multiple copies, on different media with at least 1 offsite. Where possible though, you should use managed services so the backup is done automatically for you.
Remember that backup and preservation are not the same thing (though the terms are often used interchangeably).
Backups are performed regularly to take periodic snapshots of the data for the short to medium term, whereas archiving is preserving the final version of the data for the long-term.
You should make sure your data are backed-up during the active phase of research and that any data needed for the long-term are archived.
It is also important to share your data where possible, particularly to evidence your findings.
This article reflects on an inadvertent error in a economics paper by Reinhart and Rogoff. Missing some rows out of an average gave drastically different results – what was published suggested that countries with 90% debt ratios see their economies shrink by 0.1%. Instead, it should have found that they grow by 2.2% – less than those with lower debt ratios, but not a spiralling collapse. This mistake wasn’t picked up on initially as the data hadn’t been shared. The mistake fed into government policy as the findings were used as justification for austerity measures in the UK and various other countries in the EU.
Naturally, researchers may worry that the data will be taken out of context, misinterpreted or used inappropriately. They may also be concerned about maintaining the confidentiality and security of sensitive data. Business concerns may arise as well - will data users give proper credit and acknowledgement to the scientist? Will the scientist lose a competitive advantage by sharing this valuable resource?
There are lots of reasons why researchers may be reluctant to share data, so what is the solution?
Each of these issues can, in great part, be addressed by providing rich data documentation known as ‘metadata’.
By providing metadata, the research scientist establishes the purpose, methods, sources and parameters of the data. As such, data users are given the information necessary to appropriately apply, protect and cite the data. If the metadata contains information about proprietary data processing or analysis techniques, the competitive advantage can be maintained by creating a second, more generalized, metadata record for public distribution.
To make your data shareable, you should create robust metadata and seek a second a second opinion on this to ensure it’s understandable to others. Also include reference information so others can find your data and give you credit. The metadata should be published online and packaged up with your data to deposit in repositories.
It might not be possible to preserve and share all your data, so you may need to make a selection. Some factor to consider could be what has to be kept, for example for legal reasons or to evidence findings, what is potentially useful to others or can’t be recreated. You may also be under obligation to destroy certain data due to consent agreements or commercial non-disclosure restrictions.
The Digital Curation Centre has guidance on how to select what data to keep.
Let’s close by looking briefly at the EUDAT service suite and how it helps with data management and sharing
EUDAT offers a pan-European solution, providing a generic set of data services. These are being built in close collaboration with user communities.
The services assist researchers to store, manage and process the data through-out the active phase of research, and also help to archive data and make it discoverable to others.
The B2DROP service helps you to syncronrise and exchange research data like Dropbox; B2STAGE helps you get data to computation when processing and analysing data; B2SAFE helps you to replicate the data safely; B2SHARE is a repository to archive the data and share it with others; and B2FIND is a cataloguing service that allows you and others to find relevant data.