| www.eudat.eu | Presentation given by Stéphane Coutin during the PRACE 2017 Spring School joint training event with the EU H2020 VI-SEEM project (https://vi-seem.eu/) organised by CaSToRC at The Cyprus Institute. Science and more specifically projects using HPC is facing a digital data explosion. Instruments and simulations are producing more and more volume; data can be shared, mined, cited, preserved… They are a great asset, but they are facing risks: we can miss storage, we can lose them, they can be misused,… To start this session, we will review why it is important to manage research data and how to do this by maintaining a Data Management Plan. This will be based on the best practices from EUDAT H2020 project and European Commission recommendation. During the second part we will interactively draft a DMP for a given use case.
Data management plans – EUDAT Best practices and case study | www.eudat.eu
1. EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 www.eudat.eu
Data Management Plans –
EUDAT best practices and case study
Stéphane COUTIN (CINES)
26 April 2017
This work is licensed under the Creative
Commons CC-BY 4.0 licence
2. Objectives
High level presentation of research
data management and H2020 context
Present a simple approach and draft a
DMP for a given case.
Overview of EUDAT services
4. Based in Montpellier, France – approx. 60 people
(engineers, techs, admin)
Created in 1999, aka CNUSC (Centre National
Universitaire Sud de Calcul) – created in 1980
Administrated and funded by Ministry of Higher
Education & Research (MESR)
4
Provides the French public research
community with computing
resources, services and
expertise
3 main mandates / activities:
High performance computing
Digital preservation
Hosting
Centre Informatique National de l’Enseignement
Supérieur
5. Stéphane COUTIN coutin@cines.fr
Research engineer at CINES since 2013
Initialy in digital preservation dept
Now working in HPC dept
Involved in EU projects
Leading PRACE collaboration task with other
eInfra
Working on EUDAT for collaboration with PRACE
Background in Information Systems projects and
programmes management
6. EUDAT – www.eudat.eu
Image CC-BY-NC ‘Data centre’ by Bob Mical
www.flickr.com/photos/small_realm/15995555571
7. A pan-European e-Infrastructure solution for pan-
European RI data Challenges
All RIs are facing data challenges
Where to store the growing amount of data?
How to find it?
How to make the most of it?
Solutions are needed at pan-European level
7
We need to promote synergies
Some services are common to many
communities
Costs and investments can be
optimised
Better integration of e-infras and
research infrastructures can be
achieved
9. A truly pan-European Infrastructure
EUDAT offers common data services, supporting
multiple research communities as well as individuals,
through a geographically distributed, resilient network
of 35 European organisations
The EUDAT vision is to enable
European researchers and
practitioners from any research
discipline to preserve, find,
access, and process data in a
trusted environment, as part of
a Collaborative Data
Infrastructure
10. PRACE – EUDAT collaboration
Joint Open Calls for proposals
EUDAT offering data services and resources
through regular PRACE calls
Review process is transparent to users
Joint training activities
Continuous technical discussion and
developments of new components
Definition of the EUDAT Workspace area
Synchronization of authentication credentials
for single sign-on
10
11. Quick question:
Think to your ongoing or next to start HPC project
What are your data related requirement?
What is the budget for this?
12. THE CHANGING DATA LANDSCAPE
Image CC-BY-SA ‘data.path Ryoji.Ikeda - 3’ by r2hox www.flickr.com/photos/rh2ox/9990016123
13. Data explosion
More and more data is
being created
Issue is not creating
data, but being able to
navigate and use it
Data management is
critical to make sure
data are well-organised,
understandable and
reusable
14. Digital data are fragile and susceptible to loss for a wide variety of reasons
Natural disaster
Facilities infrastructure failure
Storage failure
Server hardware/software failure
Application software failure
Format obsolescence
Human error
Malicious attack
Loss of staffing competencies
Loss of institutional commitment
Loss of financial stability
Changes in user expectations
Data loss
Image CC-BY ‘Hard Drive 016’ by Jon Ross www.flickr.com/photos/jon_a_ross/1482849745
15. Link rot – more 404 errors
generated over time
Reference rot* – link rot
plus content drift i.e.
webpages evolving and
no longer reflecting
original content cited
* Term coined by Hiberlink http://hiberlink.org
Data persistency issues
Jonathan D. Wren Bioinformatics 2008;24:1381-1385
17. Why manage research data?
To make your research easier!
To stop yourself drowning in irrelevant stuff
In case you need the data later
To avoid accusations of fraud or bad science
To share your data for others to use and learn from
To get credit for producing it
Because funders or your organisation require it
Well-managed data opens up opportunities
for re-use, integration and new science
18. H2020 open research data pilot
• Already expanded from a select pilot to all work
areas
• All need to consider which data can be made
open
• Mantra = “As open as possible as closed as
necessary”
• Underlying driver is good (FAIR) data
management
Image CC-BY-SA by SangyaPundir
19. Key requirements of the open data pilot
Beneficiaries participating in the Pilot will:
Deposit data in a research data repository of
their choice
Take measures to make it possible for others to
access, mine, exploit, reproduce and
disseminate the data free of charge
Provide information about tools and instruments
necessary for validating the results (where
possible, provide the tools and instruments
themselves)
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi
/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
22. Simple diagram focusing on data dynamics
You can use other diagram type
DFD : Data Flow Diagram
23. You and your team are submitting a proposal for a project in the domain of smart cities.
The City has implemented a large set of sensors measuring traffic. The data are collected
in the City datacenter.
You want to develop an application being able to forecast the traffic and also how it will
be impacted by events like planned roadworks. This application would run on a PRACE
site, not located in the City. On the PRACE site your storage space is limited to 10 TB.
The application uses the following inputs:
Sensors historical data over the last 12 months : sensors produce 1TB of data a day.
You implement a preprocessing module translating those data into a reduced data set
(10 MB per day). It is based on a format you have defined to describe the traffic.
The results provided by the simulation. This enables comparison between forecasted
and actual traffic in order to ‘train’ the application.
Weather data (historical and forecast) provided by the national meteo agency. They
use the SYNOP format. The volume is negligible.
Results will be accessible by the city council employees.
Create the project data flow diagram and fill the data summary chapter using a
table.
What would you appreciate to use efficiently the weather data?
Exercise – Phase 1
25. Proposed data flow diagram
Sensors collection area
PRACE HPC Site
Simulations
PRACE
Storage
Output files
extractor
Input files
Raw sensor
data
Data
Preprocessing
Reduced
sensor data
Weather data
City council
employees
Data transfer
26. Data summary table
Dataset Description Origin? Existing? Format Size Who could use it?
Raw sensor
data
Available, collected
from sensors
Various 1TB per
day
Reduced
sensor data
Actual
traffic, …
Extracted from raw
sensor data
Binary
(specific)
10 MB a
day
Our simulation
Weather
data
Actual and
forecast
Existing. Meteo open
data platform
SYNOP 1MB a
week
Our simulation
Citizens, scientists, ..
Simulation
results
Forecasted
traffic
Results of our
simulation
Binary
(specific)
10 MB a
day
City council
employees, our
application
27. CREATING
DATA
PROCESSING
DATA
ANALYSING
DATA
PRESERVING
DATA
GIVING
ACCESS TO
DATA
RE-USING
DATA
Research data lifecycle
CREATING DATA: designing research,
DMPs, planning consent, locate existing
data, data collection and management,
capturing and creating metadata
RE-USING DATA: follow-
up research, new
research, undertake
research reviews,
scrutinising findings,
teaching & learning
ACCESS TO DATA:
distributing data,
sharing data,
controlling access,
establishing copyright,
promoting data PRESERVING DATA: data storage, back-
up & archiving, migrating to best format
& medium, creating metadata and
documentation
ANALYSING DATA:
interpreting, & deriving
data, producing outputs,
authoring publications,
preparing for sharing
PROCESSING DATA:
entering, transcribing,
checking, validating and
cleaning data, anonymising
data, describing data,
manage and store data
Ref: UK Data Archive: http://www.data-archive.ac.uk/create-manage/life-cycle
28. Findable
– Assign persistent IDs, provide rich metadata, register in a searchable
resource,...
Accessible
– Retrievable by their ID using a standard protocol, metadata remain accessible
even if data aren’t...
Interoperable
– Use formal, broadly applicable languages, use standard vocabularies,
qualified references...
Reusable
– Rich, accurate metadata, clear licences, provenance, use of community
standards...
FAIR for machines as well as people
www.force11.org/group/fairgroup/fairprinciples
Making data FAIR
31. Metadata and documentation is needed to locate
and understand research data
Think about what others would need in order to find,
evaluate, understand, and reuse your data.
Get others to check the metadata to improve quality
Use standards to enable interoperability
Metadata and documentation
32. Use of standards
Controlled vocabularies for unambiguous keywords
Simple, complete andconsistent information
Appropriate description
Explanation of limitations to support reuse
Avoid special characters e.g. !@<~ etc...
Provide persistent identifiers such as DOIs
What makes metadata good?
33. The good and the bad
Metres / seconds
2015-09-10T15:00:01+01:00
Longitudinal wind speed
PDF 1.7
2008 US Population statistics
Barcelona, Venezuela
Furlongs and fortnight
10th Sept. 2015 15:00:01
U
PDF
Population statistics
Barcelona
More precise and
standardised Ambiguous
34. Digital preservation context
34
Main risks deal with:
• Comprehension
• Integrity
• Exploitation
• Valorization
Quality assurance
procedures to be setup for
• Metadata
• File formats
• Representation information
• Storage
• Access
• Technology watching
35. Based in Montpellier, France – approx. 60 people
(engineers, techs, admin)
Created in 1999, aka CNUSC (Centre National
Universitaire Sud de Calcul) – created in 1980
Administrated and funded by Ministry of Higher
Education & Research (MESR)
4
Provides the French public research
community with computing
resources, services and
expertise
3 main mandates / activities:
High performance computing
Digital preservation
Hosting
Centre Informatique National de l’Enseignement
Supérieur
36. Based in Montpellier, France – approx. 60 people
(engineers, techs, admin)
Created in 1999, aka CNUSC (Centre National
Universitaire Sud de Calcul) – created in 1980
Administrated and funded by Ministry of Higher
Education & Research (MESR)
4
Provides the French public research
community with computing
resources, services and
expertise
3 main mandates / activities:
High performance computing
Digital preservation
Hosting
Centre Informatique National de l’Enseignement
Supérieur
37. Based in Montpellier, France – approx. 60 people
(engineers, techs, admin)
Created in 1999, aka CNUSC (Centre National
Universitaire Sud de Calcul) – created in 1980
Administrated and funded by Ministry of Higher
Education & Research (MESR)
4
Provides the French public research
community with computing
resources, services and
expertise
3 main mandates / activities:
High performance computing
Digital preservation
Hosting
Centre Informatique National de l’Enseignement
Supérieur
38. Based in Montpellier, France – approx. 60 people
(engineers, techs, admin)
Created in 1999, aka CNUSC (Centre National
Universitaire Sud de Calcul) – created in 1980
Administrated and funded by Ministry of Higher
Education & Research (MESR)
4
Provides the French public research
community with computing
resources, services and
expertise
3 main mandates / activities:
High performance computing
Digital preservation
Hosting
Centre Informatique National de l’Enseignement
Supérieur
39. Based in Montpellier, France – approx. 60 people
(engineers, techs, admin)
Created in 1999, aka CNUSC (Centre National
Universitaire Sud de Calcul) – created in 1980
Administrated and funded by Ministry of Higher
Education & Research (MESR)
4
Provides the French public research
community with computing
resources, services and
expertise
3 main mandates / activities:
High performance computing
Digital preservation
Hosting
Centre Informatique National de l’Enseignement
Supérieur
40. Based in Montpellier, France – approx. 60 people
(engineers, techs, admin)
Created in 1999, aka CNUSC (Centre National
Universitaire Sud de Calcul) – created in 1980
Administrated and funded by Ministry of Higher
Education & Research (MESR)
4
Provides the French public research
community with computing
resources, services and
expertise
3 main mandates / activities:
High performance computing
Digital preservation
Hosting
Centre Informatique National de l’Enseignement
Supérieur
43. CDI Data Domain
EUDAT Data Domain modeled on the ANDS1 Data Curation Continiuum
1. Australian National Data Service organization – www.ands.org.au
44. Store and exchange data with
colleagues and team members,
including research data not
finalized for publishing
share data with fine-grained
access controls
synchronize multiple versions of
data across different devices
An ideal solution for researchers and scientists to:
Features:
20 GB storage per user
Living objects, so no PIDs
Versioning and offline use
Desktop synchronisation
Sync and Share Research Data
B2DROP – personal cloud
b2drop.eudat.eu
45. store data safely at a trusted
and certified data centre
preserve data to guarantee
long-term persistence
control access and share
data with colleagues and the
world
A winning solution for researchers, scientists and
communities to:
Features:
Metadata management
Permanent PIDs
Open Access support
Store and Publish Research Data
B2SHARE - repository
b2share.eudat.eu
46. replicate research data into secure
data stores
archive and preserve research data
in the long-term
bring data close to powerful
compute resources
co-locate data with different
communities
benefit from economies of scale
The ideal solution for communities with no facility for
archival to:
Features:
Large-scale storage
Robust and highly available
Permanent PIDs
Replicate Research Data Safely
B2SAFE - preservation
eudat.eu/b2safe
47. move large amounts of data
between data stores and high-
performance compute resources
re-ingest computational results
back into EUDAT
deposit large data sets onto
EUDAT resources for long-term
preservation
Facilitating communities to:
Features:
High-speed transfer
Reliable and light-weight
Manages permanent PIDs
Get Data to Computation
B2STAGE - transfer
eudat.eu/b2stage
48. seek data objects and collections
using powerful metadata searches
catalogue community data by
means of selected metadata
browse through multi-disciplinary
data collections filtered by
content, provenance and
temporal keywords
A metadata catalogue service to:
Features:
Simple to use
Standards-based
Comprehensive catalogue
Find Research Data
B2FIND - catalogue
b2find.eudat.eu
49. Update the dataflow diagram with EUDAT services
you could use for preservation and metadata
publication.
Exercise phase 3
50. DFD with other EUDAT services
Simulations
PRACE
Storage
Output files
extractor
Input files
EUDAT B2SHARE
B2SHARE
storage
Web front end
Or API
Results
Traffic data
Data extractor
(uses API)
Publication
(uses API)
Actual traffic
Forecast traffic
Citizens,
researchers,
companies, ...
Search and retrieve data
EUDAT Site
EUDAT B2SAFE sorage
EUDAT
B2SAFE
Data and metadata
EUDAT
B2FIND
Metadata
Metadata search
Replication
51. www.eudat.eu
Thanks – any questions
Acknowledgements:
Thanks to Mark van de Sanden, Marjan Grootveld , Sarah Jones
and Giuseppe Fiameni for some of the slides