Data Management in the context of Open Science.
Because open access become mandatory for publications and project-funded research data, it is the responsibility of each researcher to be informed and then trained in new practices.
2. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 2
• Links between Research Data and Open Science
• How the management and preservation of Research Data
can facilitate the work of researchers
• How to address concerns about Data Sharing
• The research Data life cycle
At the end of the course you should understand...
3. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 3
The Reproducibility Crisis
In recent years, evidence has emerged from disciplines ranging from biology to
economics that many scientific studies are not reproducible.
This evidence has led to declarations in both the scientific and lay press that
science is experiencing a “reproducibility crisis” and that this crisis has
significant impacts on both science and society, including misdirected effort,
funding, and policy implemented on the basis of irreproducible research.
Franklin Sayre, Amy Riegelman (2018) C&RL 79(1) https://doi.org/10.5860/crl.79.1.2
4. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 4
This phenomenon appears, for example, in medicine, more precisely in
epidemiology, where, based on a large number of data (weight, age of the first
cigarette, etc.) and a large number of possible outcomes (breast cancer, lung
cancer, car accident, etc.), hazardous associations are made (a posteriori) and
statistically "validated".
p-hacking
p-hacking (also data dredging data fishing, data snooping, … ) is the misuse of
data analysis to find patterns in data that can be presented as statistically
significant when in fact there is no real underlying effect.
This is done by performing many statistical tests on the data and only paying
attention to those that come back with significant results, instead of stating a
single hypothesis about an underlying effect before the analysis and then
conducting a single test for it
https://en.wikipedia.org/wiki/Data_dredging
5. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 5
Cholesterol and Controversy: Past, present and Future
By Jeanne Garbarino on November 15, 2011
Scientific American - Blog
https://blogs.scientificamerican.com/guest-blog/cholesterol-
confusion-and-why-we-should-rethink-our-approach-to-statin-
therapy/
Cholesterol controversy
The French paradox: lessons for other countries
Heart. 2004 Jan; 90(1): 107–111.
doi: 10.1136/heart.90.1.107
Jean Ferrières
Plot of death rate from coronary heart disease (1977)
correlated with daily dietary intake (from 1976 to 1978) of
cholesterol and saturated fat as expressed by the
cholesterol fat index (CSI) per 1000 kcal
Correlation does not mean causal relationship !
6. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 6
Open Science
7. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
DATA Studies
Research Project
During a research project
Know-how knowledge
Input Output
7
8. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
What do they become?
• Nothing ! They rest on a disk space (up to its death!)
Among the possible scenarios, two of them are extreme
• Creation of a comprehensive database managing all
data and metadata in its entirety, associated with a
visualization and querying interface.
Expected objectives
After the project is completed
DATA Studies
8
Research Project
9. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
Expected objectives
Scientific Data Repositories
Enrichment
Expected links
DATA Studies
Publishing policies
…
9
https://ec.europa.eu/research/participants/docs/h2020-funding-guide/cross-cutting-issues/open-access-dissemination_en.htm
Research Project
10. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
NATIONAL PLAN FOR OPEN SCIENCE
Open science is the practice of making research publications and
data freely available (transparency)
Open science seeks to create an ecosystem in which scientific
research is more cumulative (interdisciplinary)
Open science makes knowledge accessible to all (civic aspect)
Open science also drives scientific progress (reactivity)
Finally, open science fosters scientific integrity and people’s trust
in science (ethics)
http://cache.media.enseignementsup-recherche.gouv.fr/file/Recherche/50/1/SO_A4_2018_EN_01_leger_982501.pdf
announced by Frédérique Vidal on 4 July 2018
makes open access mandatory for publications and project-funded research data.
10
12. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 12
Interdisciplinary
Data
Science
Scientific
Field
IT
Skills
Data Management
Data InterpretationData Analysis
Open Science is a new research paradigm facing many challenges, mainly :
Requirement of many skills
the ingrained research habits
Statistics
Software Data
13. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
Science today - context
Knowledge creation
Experimental science
Theoretical science
Data-intensive science /
Data-driven science
Requires three skills:
Scientific field
Information management
Data processing
Research Paradigms
What are the
consequences on the
data?
Publications + Data
Not only induction, deduction
but above all abduction >> data science
New Paradigm
13
14. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 14
Abduction
Abduction is a type of reasoning consisting in inferring probable causes to
an observed fact.
In other words, it is a question of establishing a most probable cause of a
fact found …
… and stating, as a hypothesis, that the fact in question probably results
from that cause.
Data Science
Data-driven science
15. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
Data from observation, experimentation or derived from existing sources
that are analyzed in order to produce or validate research results original
What is the Research Data ?
Digital Data Tables, Text Files, Sound Recordings, Completed
Survey Questionnaires, Image or Video Database, Derived data or
compiled
“Data, or units of information, related to research activities, whether funded or
not, are often organized or formatted in such a way that they can be
communicated, interpreted and processed. Research Data are all the information
you use as part of your research “ according to the University of Bristol
15
16. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 16
“Data management should be woven into every course in science.”
Data's shameful neglect
Nature 461, 2009 (Editorial)
orchestrates data for efficient and reliable use
increases the impact of research,
improves the visibility of research
allows data to be shared securely
makes it easy to find the data
reduces the risk of data loss
increases citation rates
requirement of most funders and publishers
RDM benefits
Data Management Facilitates
Sharing and Re-use …
Why do we have to "manage" the Research Data
based on the Open Science paradigm ?
https://www.nature.com/articles/461145a
17. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
• Primary/secondary
• Experimental, observational, simulation, derived, compiled, canonical
• Raw, processed, aggregated, enriched, annotated, formatted, standardized, processed,
published
• Structured/unstructured, homogenous/heterogeneous
• Free / protected
Manage?... but manage what?
17
18. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 18
Data
Creation
Data
processing
Data
Analysis
Data
preservation
Data
dissemination
Re-Use
Data
Collection: experiments, measurements,
observations, simulations
Creation
of metadata
Enter, format, clean,
organize, verify, validate,
describe, store
Interpretation, visualization,
formatting, publication
Migration, reformatting,
back-up, permanent storage,
Metadata, documentation, certification
Distribution, referencing,
Reporting, rights management
Data journals
Teaching,
new research,
evaluation
Curation
of data
The data life cycle
Integrate scientific data
management into research
activities
19. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
IT Manager / System Administrator
«skilled partner» in data archiving and
preservation
Data Creator
people who produce digital data
Data Manager
expert on the management, reporting,
storage and dissemination of research data
Data Scientist
data analysis
A wide variety of fields
Rapid developments - Continuing training required
New jobs require more and more IT skills
Research Data Management
Support - skills and professions
The data life cycle
at each stage, services can be developed:
- development of Data Management Plan (DMP)
- identification of metadata describing the data
- selection of warehouses to store data
- data retention infrastructures
- data discovery and mining tools
- data reuse framework
The scientific data life cycle is the set of
stages of management, conservation,
dissemination and reuse of scientific
data related to research activities.
19
https://ec.europa.eu/research/openscience/pdf/os_skills_wgreport_final.pdf
20. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 https://www6.inra.fr/datapartage/
A data management plan or DMP is a formal document that outlines
how data will be obtained, processed, organized, stored, secured, preserved, shared
both during a research project, and after the project is completed.
The goal of a data management plan is to consider
the many aspects of data management, metadata generation, data preservation, and analysis
before the project begins
this ensures that data are well-managed
in the present, and prepared for preservation in the future.
Optimization of Data Sharing and
Interoperability of Research
https://dmp.opidor.fr/
Main step of data management
Tool to be used as soon as projects are set up
Data Management Plan (DMP)
20
21. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 21
Operational DetailsData Management Plan (DMP)
22. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 22
How does the
management of data is
it funded, especially in
the long term?
Resources
What does the project consist of?
Who are the partners?
What policy on data management?
Who is responsible for the
management of data?
Responsibilities
in the project
What data will be produced/used
during the course of the project
(type, format, volume and
increase...) ?
How will they be produced?
processed?
Data collection
How, where, where, by
whom, will be stored,
backed up and secured
the data?
Data backup
Data Management Plan (DMP)
Who will be able to access the
data? The data will they be shared?
published? With whom? How?
How long does it take? Under which
license?
Data Access and Data sharing
Who will own it?
of the data produced
External data
will they be used?
Intellectual Property
What is the plan for
long-term archiving and
preservation?
Data Archiving
How will the data be identified,
described? What metadata
standards will be used?
How will the metadata be
generated?
Data Documentation
23. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
Findable Accessible
Interoperable Reusable
Describe your data in a data repository
Apply a persistent identifier
Consider what will be shared
Obtain participant consent
Use open formats
Consistent vocabulary
Common metadata standards
Consider permitted use
Apply appropriate license
23
The FAIR Data Principles are a set of guiding principles to make data accessible, interoperable and
reusable (Wilkinson et al.,2016 Scientific Data - https://www.nature.com/articles/sdata201618).
https://www.force11.org/group/fairgroup/fairprinciples
RDM based on the Open Science : THE FAIR DATA PRINCIPLES
24. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 24
THE FAIR DATA PRINCIPLES
A1.2 => Open as much as possible, Close as much as necessary
25. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 25
THE FAIR DATA PRINCIPLES
5 ★ OPEN DATA
26. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 26
It is above all an approach to measure
the maturity of your data in relation to
Open DATA
THE FAIR DATA PRINCIPLES
https://www.go-fair.org/
From Principles towards Implementations
The Internet of FAIR Data & Services
27. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 27
DMP model H2020 based on FAIR principles
https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
Guidelines on FAIR Data Management in Horizon 2020
1. Data Summary
2. FAIR data
2.1. Making data findable, including provisions for metadata
2.2. Making data openly accessible
2.3. Making data interoperable
2.4. Increase data re-use (through clarifying licences)
3. Allocation of resources
4. Data security
5. Ethical aspects
6. Other issues
7. Further support in developing your DMP
28. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
Data on the web, open license
… in a structured format
… and non-proprietary format
… identified by URIs
… and related to others (data)
5 ★ OPEN DATA
Publish data "5 Gold stars"
Tim Berners-Lee, the inventor of the Web and Linked Data
initiator, suggested a 5-star deployment scheme for Open Data
28
K. Janowicz et al (2014) Five Stars of Linked Data Vocabulary Use
Semantic Web 0 (2014) 1–0
https://geog.ucsb.edu/~jano/swj653.pdf
See also
29. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
SERVICE DESCRIPTION
re3data is a global registry of research data repositories from a diverse range of academic disciplines.
It provides information on repositories for the permanent storage and access of data sets to
researchers, funding bodies, publishers and scholarly institutions.
Research Data Repositories are based on
web applications to preserve, share, cite, search and analyse research data.
…
https://data.inra.fr/
Science Europe’s Framework for Discipline-specific
Research Data Management
29
https://www.nature.com/sdata/policies/repositories
Recommended Data Repositories
https://fairsharing.org/databases/
30. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 30
https://data.inra.fr/
31. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 31
…
2,406 Data Repositories (Oct 10, 2019)
https://www.re3data.org/metrics
Not FAIR !!
FAIR ?
32. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 32
Reproducible Research
in the context of Open Science
33. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 33
Some issues often arise with users jumping straight into software implementations of
methods (e.g. in R) that may lack documentation on biases and assumptions that are
mentioned in the original papers.
Halsey et al (2015) The fickle P value generates irreproducible results, Nature Methods 12, 179–185
Calls for Open Science & Reproducible Research
Typical examples of where problems can arise
A major cause of lack of repeatability (often not being considered) is the wide sample-
to-sample variability in the P value. Due to that p-value is fickle, the interpreting of
analyses should not be based predominantly on this statistic.
Overfitting a model is a condition where a statistical model begins to describe the
random error in the data rather than the relationships between variables. This
problem occurs when the model is too complex. In regression analysis, overfitting
can produce misleading R-squared values, regression coefficients, and p-values.
https://statisticsbyjim.com/regression/overfitting-regression-models/
34. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 34
Calls for Open Science & Reproducible Research
Others issues
Loss of data and/or information :
Not regularly backing up your data is considered as professional negligence
Lack of knowledge, lack of technical skills, having more or less hazardous practices :
Training is a right but also a duty to claim to fully assume a function / mission
Continuous evolution of software libraries & their dependencies
Problems related to digital accuracy from one computer to another,
Versioning,
…
Miscellaneous
35. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 35
“Citations to unpublished data and personal communications
cannot be used to support claims in a published paper”
“All data necessary to understand, assess, and extend the
conclusions of the manuscript must be available to any reader
of science.
What Science Requires
Calls for Open Science & Reproducible Research
36. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 36
Research is defined as reproducible when then published results
can be replicated using the documented data, code, and methods
employed by the author or provider without the need for any
additional information or needing to communicate with the author
or provider
Reproducible Research
https://nnlm.gov/data/thesaurus/reproducible-research
Reproducible research is
is not a guarantee of research quality, but a guarantee of transparency.
contributes to quality but does not replace it
37. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 37
Reproducibility has the potential to serve as a minimum standard for judging scientific
claims when full independent replication of a study is not possible
Reproducible Research
38. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 38
Reproducible Research
Good practices
Data Collection and Management :
Write an information collection protocol: this protocol should be part of the published article
Maintain a laboratory notebook
Collect data repeatedly AND reproducibly
Research Compendium :
facilitates reproducible research by bringing together in a single
virtual "place" the data, codes, protocols and documentation
related to a research project
Full computational environment used to produce the results in the
paper such as the code, data, etc. that can be used to reproduce
the results and create new work based on the research.
39. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 39
Reproducible Research
Good practices
Manage what ? What kind of data/information ?
The minimal but mandatory set of files
From RAW DATA To Final results
Including
• Standard Operating Procedures (SOP)
• Data reporting
Checking
Validation
Tracing
Raw Data
Processed
data
40. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 40
Reproducible Research
Good practices
The minimal but mandatory set of files
Checking
Validation
Tracing
The final
quantification
results file
The calibration file
(Calibration curves based on
standard compounds)
The Excel worksheet(s)
having served to calculate
the quantification
The compound
attribution zones
An image of an annotated
NMR spectrum
Protocol documents that describe each step of the process (Quality Assurance):
I. Analytical sample preparation
II. Analytical processing
III. Data processing
IV. Quantification
The raw
NMR
spectra
(ZIP file)
Example: 1H-NMR Analytical Technique
http://nmrprocflow.org/ex1
Example of full 1H-NMR data set
Manage what ? What kind of data/information ?
41. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 41
Reproducible Research
Good practices
Backups :
Not regularly backing up your data is considered as professional negligence
Versions and Archives :
Safeguarding the successive stages of document development (texts, data, codes, etc.) is one of
the fundamental building blocks of reproducible research
Implementation of a version management strategy
Git + local or institutional Forge (i.e. Forgemia), GitHub (i.e. github/INRA)
Research data repositories (re3data.org)
42. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 42
Reproducible Research
Good advices
Data exploration
Use tools that you know well or that allow you to gain in efficiency.
But
Learn to program :
Limit the use of graphical interfaces (GUI) for subtle or repetitive tasks
Be able to express in a clear, documented and unambiguous way what you want the software to do
A program can be simply expressed in a few lines only. The higher the level of language used, the less
there will be to write.
Typical examples of reproducible research comprise compendia of data, code and text files, often
organised around an R Markdown source document or a Jupyter notebook.
43. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 43
Open Data for Access and Mining
ODAM Framework
Example of a Data Management System in the context of Open Science
http://pmb-bordeaux.fr/dataexplorer/
http://pmb-bordeaux.fr/odam/FAIR_and_DataLife_DJ_Oct2019.pdf
https://nbviewer.jupyter.org/github/djacob65/binder_odam/blob/master/PyODAM_api_PCA.ipynb
44. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019
https://doranum.fr/
Research Data - Digital Learning
https://coop-ist.cirad.fr/gerer-des-donnees
CoopIST – Cooperate in Scientific and Technical Information
INRA services and resources
https://www6.inra.fr/datapartage
Some useful links related to Open Science / Data Management
The future of science is Open
https://www.fosteropenscience.eu/
Building the social and technical bridges to enable open sharing and re-use of data
https://www.rd-alliance.org/ 23 Things: Libraries for Research Data
44
45. Daniel Jacob – INRA UMR 1332 BFP – Oct 2019 45
Vers une recherche reproductible : Faire évoluer ses pratiques
https://hal.archives-ouvertes.fr/hal-02144142v1
https://englianhu.files.wordpress.com/2016/01/reproducible-research-with-r-and-studio-2nd-edition.pdf
Reproducible Research with R and RStudio Second Edition
Reproducibility and Replicability in Science
https://www.nap.edu/catalog/25303/reproducibility-and-replicability-in-science
Books online related to Reproducible Research