This document discusses challenges and solutions around research data management in Canada. It argues that the real challenges are curating metadata and ensuring long-term access to data, rather than hardware storage. It proposes establishing data stewardship facilities that can provide long-term storage, access, and curation of research data from multiple related projects in a cost-effective way. Such facilities could help address issues around scattered and inaccessible data by acting as a central portal and ensuring data stewardship beyond individual projects. Examples of existing Canadian data stewardship facilities in astronomy, polar research, and social/health statistics are provided.
1. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
BEYOND INFRASTRUCTURE
GAPS
CASRAI Canada ReConnect14
Benoît Pirenne, Director, User Engagement, Ocean Networks
Canada. Ottawa, November 19, 2014
2. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
OR: HOW WILL WE SOLVE RESEARCH
DATA MANAGEMENT ISSUES IN
CANADA?
CASRAI Canada ReConnect14
Benoît Pirenne, Director, User Engagement, Ocean Networks
Canada. Ottawa, November 19, 2014
3. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
Why data management?
❖Research Data Management has recently received a lot
of attention
- Science research equipment and programmes are costly to setup and/or
operate and therefore data must be re-used and shared with many other
users
- There is potential for new insight to emerge from a re-use of the data
- Too many (smaller) research programme don’t have a data management
plan and data end up being lost
4. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
DM Activities
Sensors,
Other Digital
Data
Archive Initial Users
Other Users (≠
disciplines,
public)
Data
Acquisition
Format
translation,
data products
5. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
Challenges of DM
❖People focus on the hardware issues:
- That’s chasing the wrong rabbit!
- [LHC’s 25PB/yr]: “Storing the data is not a problem: hard drives are cheap
and getting cheaper. The challenge is preserving knowledge that is less
commonly stored — the software, algorithms and reference plots specific to
each experiment. These often degrade or disappear with time”, says Cristinel
Diaconu (nature.com Nov. 26, 2013)!
- Funding agencies prefer the hardware focus, because funding is a one-off!
6. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
Challenges of DM
❖ Real challenge: data description (metadata)
- Requires: gathering, indexing, describing and curating research data
at all stages of data collection, preparation, archival and distribution
- Metadata is essential for, and part of, data quality assessment
- Includes source, full description, calibration, annotations, space-time
info, …, ownership, access authorizations, …
- Includes the link between data and resulting publications
7. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
DM Activities
Sensors,
Other Digital
Data
Archive Initial Users
Other Users (≠
disciplines,
public)
Data
Acquisition
Format
translation,
data products
Metadata
8. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
Challenges of DM
❖ Real challenge: data description (metadata)
- Not popular with funding agencies because metadata
requires having expert and dedicated staff to curate data
- Metadata requires software systems to be maintained to
support the activity
- Metadata is a long term commitment
9. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
Challenges of DM
❖Data access
- Search through data (not always possible), search through
metadata
- Metadata encoding and transport standards needed
- Data formats are discipline-specific
- Uniform, interoperable access is a huge challenge (e.g., VO)
10. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
Challenges of DM
- Convince PIs and funding agencies that good
Data Management is important.
- But this battle is by now almost won. (NSF, TC3+, … )
- New CFI Cyber-Infrastructure initiative to be announced to
support most needs of data stewardship
11. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
How can we afford DM?
❖Data Management is affordable
- Experience shows that across disciplines, the average cost to
set up a DM is ~10% of the costs of the projects it supports
- Experience shows that the burden of operating a DM is
about 10% of the overall projects operating costs
- DM costs fall down further when projects are no longer
operational
12. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
Towards Data
Stewardship facilities
❖At the service of many projects in related disciplines
❖ Provides long-term data storage, access and stewardship, well
beyond the lifetime of individual projects
❖Need is particularly acute for small projects
❖Avoid the creation of many ad-hoc systems that can’t be maintained
long-term
❖ International quality standards exists (ICSU’s World Data System)
13. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
DM Activities
Sensors,
Other Digital
Data
Archive Initial Users
Other Users (≠
disciplines,
public)
Data
Acquisition
Format
translation,
data products
Data Stewardship Facilty
14. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
DSF for users
❖Address the following:
❖Too many data repositories for similar datasets
❖poorly described results
❖untraceable sources
❖unreadable digital media
❖ “abandoned”, inaccessible records
❖ incomplete dataset description
15. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
DSF for users
❖Are a one-stop-shop for data in a given discipline,
and a portal to international resources
❖Allow scientists to focus on science, not on data
management
❖Ensure stewardship of data beyond project funding
❖Ensure data will remain citable
16. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
DSF for users
- Buy-in from users and PIs regarding:
- Development of trust with external entities managing their data
- The definition of a(n open) data policy, sharing of data
- Being thorough with data/experimentation description (Metadata)
- Realizing that data management is not achieved with a bit of
hardware and software
In progress:
use of clouds
increasing
In progress: more
and more open data
policies around
17. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
DSF for Funding agencies
❖Ability make economies of scale
❖DSF have expertise in data management and relevant science disciplines
❖DSF have the wherewithal to remain at the leading edge of technology
❖Users already used to entrust their data to “the Cloud”, and work using
remote compute resources
❖With similar international peers, have a voice at the interoperability and
standards table
❖Newest CFI Cyber Infrastructure program is a step in the right direction
18. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
Challenges For DSF’s
❖Has to deal with users for whom the data
volumes are unheard of!
19. DISCOVER THE OCEAN. UNDERSTAND THE PLANET.
Canadian DSF examples
❖Canadian Astronomy Data Centre (CADC) is a great example of
discipline specific Data Stewardship Facility
❖Canadian Polar Data Network (CPDN) — includes multi-disciplinary
data
❖Canadian Research Data Centre Network (CRDCN) (social and
population health statistics)
❖…