Presentation at 3rd LEARN workshop on Research Data Management, “Make research data management policies work”
Helsinki, 28 June 2016, by Sarah Callaghan, STFC Rutherford Appleton Laboratory
Presentation at 3rd LEARN workshop on Research Data Management, “Make research data management policies work”
Helsinki, 28 June 2016, by Sarah Callaghan, STFC Rutherford Appleton Laboratory
Open Data in a Big Data World: easy to say, but hard to do?
1.
Open Data in a Big Data World: easy to
say, but hard to do?
Sarah Callaghan
sarah.callaghan@stfc.ac.uk
@sorcha_ni
ORCID: 0000-0002-0517-1031
Geoffrey Boulton, Dominique Babini, Simon Hodson, Jianhui Li, Tshilidzi
Marwala, Maria Musoke, Paul Uhlir, Sally Wyatt
3rd LEARN workshop on Research Data Management,
“Make research data management policies work”
Helsinki, 28 June 2016
3.
The Data Deluge
http://www.economist.com/node/21521549
http://www.leadformix.com/blog/2013/02/the-big-data-deluge/
4.
It used to be “easy”…
Suber cells and mimosa
leaves. Robert Hooke, Micrographia,
1665
The Scientific Papers of William Parsons,
Third Earl of Rosse 1800-1867
…but datasets have gotten so big, it’s not useful
to publish them in hard copy anymore
5.
Hard copy of the Human Genome at the
Wellcome Collection
6.
Example Big Data: CMIP5
CMIP5: Fifth Coupled Model
Intercomparison Project
• Global community activity under the
World Meteorological Organisation (WMO)
via the World Climate Research
Programme (WCRP)
•Aim:
– to address outstanding scientific
questions that arose as part of the
4th
Assessment Report process,
– improve understanding of climate,
and
– to provide estimates of future
climate change that will be useful to
those considering its possible
consequences.
Many distinct experiments, with very
different characteristics, which influence the
configuration of the models, (what they can
do, and how they should be interpreted).
7.
Simulations:
~ 90,000 years
~ 60 experiments
~ 20 modelling centres (from around the world)
using
~ 30 major(*) model configurations
~ 2 million output “atomic” datasets
~ 10's of petabytes of output
~ 2 petabytes of CMIP5 requested output
~ 1 petabyte of CMIP5 “replicated” output
Which are replicated at a number of sites
(including ours)
Major international collaboration!
Funded by EU FP7 projects (IS-ENES2,
Metafor) and US (ESG) and other national
sources (e.g. NERC for the UK)
CMIP5 numbers
8.
10
Summary of the CMIP5 example
The Climate problem needs:
– Major physical e-infrastructure (networks, supercomputers)
– Comprehensive information architectures covering the whole information life
cycle, including annotation (particularly of quality)
… and hard work populating these information objects, particularly with
provenance detail.
– Sophisticated tools to produce and consume the data and information
objects
– State of the art access control techniques
Major distributed systems are social challenges as much as technical challenges.
CMIP5 is Big Data, with lots of different participants and lots of different
technologies.
It also has a community willing to work together to standardise and automate data
and metadata production and curation, and with the willingness to support the
effort needed for openness.
9.
Big Data:
•Industrialised and standardised data
and metadata production
•Large groups of people involved
•Methods for making the data open,
attribution and credit for data creation
established
Long Tail Data:
•Bespoke data and metadata creation
methods
•Small groups/lone researchers
•No generally accepted methods for
attribution and credit for data creation.
Often data is closed due to lack of effort
to open it
https://flic.kr/p/g1EHPR
10.
Most people have an idea of what a
publication is
11.
Some examples of data (just from the
Earth Sciences)
1. Time series, some still being updated
e.g. meteorological measurements
2. Large 4D synthesised datasets, e.g.
Climate, Oceanographic, Hydrological
and Numerical Weather Prediction
model data generated on a
supercomputer
3. 2D scans e.g. satellite data, weather
radar data
4. 2D snapshots, e.g. cloud camera
5. Traces through a changing medium,
e.g. radiosonde launches, aircraft
flights, ocean salinity and temperature
6. Datasets consisting of data from
multiple instruments as part of the
same measurement campaign
7. Physical samples, e.g. fossils
13.
Data, Reproducibility and Science
Science should be reproducible –
other people doing the same
experiments in the same way should
get the same results.
Observational data is not
reproducible (unless you have a time
machine)
Therefore we need to have access to
the data to confirm the science is
valid!
Poor data analysis generates false
facts – and false facts &
inaccessible data undermine
science & its credibility
http://www.flickr.com/photos/31333486@N00/1893012324/siz
es/o/in/photostream/
14.
A crisis of reproducibility and
credibility?
The data providing the evidence for a published concept MUST be concurrently
published, together with the metadata. To do otherwise is scientific MALPRACTICE
Pre-clinical oncology – 89% not reproducible
Why?
•Misconduct/fraud
•Invalid reasoning
•Absent or inadequate data and/or metadata
15.
We’re only going to get more data
More big data - linked data – machine learning
The internet of things
So, what must we do?
•Concurrently publish data and metadata that are the evidence for a published
scientific claim – to do otherwise is malpractice
•Data science skills for researchers
•Re-establish standards of reproducibility for a data-intensive age
16.
• Patterns not hitherto seen
• Unsuspected relationships
• Integrated analysis of diverse data (e.g. natural & social science)
• Complex systems
e.g. complexity: dynamic evolution and system state
But not all research is or needs to be data-intensive
Scientific Opportunities of Big Data
https://www.clickz.com/clic
kz/column/2389218/create
-better-content-via-humor
18.
Data supporting a published claim Other data for re-use & integration
Pillars of the Digital Revolution
Big Data
Volume
Velocity
Variety
Veracity
Linked
Data
Many
databases
Semantic
Relations
Deeper
meaning
Foundations : Openness
Machine analysis & learning
The Open Data Edifice
19.
Open Data initiatives in areas of:
Life sciences
Earth Science,
Environmental Science
Food Science
Agricultural Science
Chemical Crystallography
Bioinformatics/Genomics
Linguistics
Social Sciences
Evolutionary biology
Biodiversity
Astronomy
Earth Observation (GEO)
Archaeology
Atmospheric sciences
EMBL-EBI services
Labs around the
world send us
their data and
we…
Archive it
Classify it
Share it with
other data
providers
Analyse, add
value and
integrate it
…provide
tools to help
researchers
use it
A collaborative
enterprise
Elixir programme
It is happening: bottom-
up Open Data initiatives
20.
The Open Data Iceberg
The Technical Challenge
The Consent Challenge
The Institutional Challenge
The Funding Challenge
The Support Challenge
The Skills Challenge
The Incentives Challenge
The Mindset Challenge
Processes &
Organisation
People
Developed from: Deetjen, U., E. T. Meyer and R. Schroeder
(2015). OECD Digital Economy Papers, No. 246, OECD
A National Infrastructure
Technology
21.
Scientists
i.Publicly funded scientists have a responsibility to contribute to the
public good through the creation and communication of new
knowledge, of which associated data are intrinsic parts. They should
make such data openly available to others as soon as possible after
their production in ways that permit them to be re-used and re-
purposed.
ii. The data that provide evidence for published scientific claims
should be made concurrently and publicly available in an
intelligently open form. This should permit the logic of the link
between data and claim to be rigorously scrutinised and the
validity of the data to be tested by replication of experiments or
observations. To the extent possible, data should be deposited in
well-managed and trusted repositories with low access barriers.
From the Accord: Responsibilities
22.
Creating a dataset is hard work!
"Piled Higher and Deeper" by Jorge Cham
www.phdcomics.com
Documenting a dataset so that it is usable and understandable by
others is extra work!
23.
“I’m all for the free sharing
of information, provided
it’s them sharing their
information with us.”
http://discworld.wikia.com/wiki/Mustrum_Ri
dcully
Mustrum Ridcully, D.Thau., D.M., D.S.,
D.Mn., D.G., D.D., D.C.L., D.M. Phil.,
D.M.S., D.C.M., D.W., B.El.L,
Archancellor, Unseen University, Anhk-
Morpork, Discworld
- As quoted in “Unseen Academicals”, by
Terry Pratchett
24.
Open is not enough!
“When required to make the data available by
my program manager, my collaborators, and
ultimately by law, I will grudgingly do so by
placing the raw data on an FTP site, named
with UUIDs like 4e283d36-61c4-11df-9a26-
edddf420622d. I will under no circumstances
make any attempt to provide analysis source
code, documentation for formats, or any
metadata with the raw data. When requested
(and ONLY when requested), I will provide an
Excel spreadsheet linking the names to data
sets with published results. This spreadsheet
will likely be wrong -- but since no one will be
able to analyze the data, that won't matter.”
- http://ivory.idyll.org/blog/data-
management.html https://flic.kr/p/awnCQu
25.
Incentives for Open Data
• Need reward
structures and
incentives for
researchers to
encourage them to
make their data open
• Data citation and
publication
• (again, issues with
treating data as a
special case of
publications…)
27.
What the data set looks
like on disk
What the raw data files look like.
I could make these files open
easily, but no one would have
a clue how to use them!
The
Understandability
Challenge: Data
28.
It’s ok, I’ll just put it out there and if it’s
important other people will figure it out
These documents have been preserved for thousands of years!
But they’ve both been translated many times, with different meanings each time.
We need Metadata to preserve Information
We can’t rely on Data Archaeology
Phaistos Disk, 1700BC
30.
It’s not just data!
• Experimental protocols
• Workflows
• Software code
• Metadata
• Things that went wrong!
• …
31.
Usability, trust, metadata
http://trollcats.com/2009/11/im-your-friend-and-i-
only-want-whats-best-for-you-trollcat/
When you read a journal paper, it’s easy to
read and get a quick understanding of the
quality of the paper.
You don’t want to be downloading many
GB of dataset to open it and see if it’s any
use to you.
Need to use proxies for quality:
•Do you know the data source/repository?
Can you trust it?
•Is there enough metadata so that you can
understand and/or use the data?
In the same way that not all journal
publishers are created equal, not all data
repositories are created equal
Example metadata from a published
dataset:
“rain.csv contains rainfall in mm for each
month at Marysville, Victoria from
January 1995 to February 2009”
Lindenmayer, David B.; Wood, Jeff; McBurney, Lachlan;
Michael, Damian; Crane, Mason; MacGregor, Christopher;
Montague-Drake, Rebecca; Gibbons, Philip; Banks, Sam C.;
(2011): rain; Dryad Digital Repository.
http://doi.org/10.5061/DRYAD.QP1F6H0S/3
32.
Should ALL data be open?
Most data produced through
publically funded research
should be open.
But!
• Confidentiality issues (e.g.
named persons’ health records)
• Conservation issues (e.g. maps
of locations of rare animals at
risk from poachers)
• Security issues (e.g. data and
methodologies for building
biological weapons) There should be a very good
reason for publically funded
data to not be open.
33.
Getting scooped
http://www.phdcomics.com/comics/archive.php?comicid=795
It happened to me!
I shared my data with another research group. They published
the first results using that data.
I wasn’t a co-author. I didn’t get an acknowledgement.
34.
Citeable does not equal Open!
Just like you can cite a paper that is
behind a paywall, you can cite a
dataset that isn’t open.
Making something citeable means
that:
• You know it exists
• You know who’s responsible for it
• You know where to find it
• You know a little bit about it (title,
abstract,…)
Even if you can’t download/read the
thing yourself.
Citation gives benefits that
encourage data producers to
make their data open
36.
Inputs Outputs
Open access
Administrative
data (held by
public
authorities e.g.
prescription
data)
Public Sector
Research data
(e.g. Met
Office weather
data)
Research
Data (e.g.
CERN,
generated in
universities)
Research
publications
(i.e. papers
in journals)
Open data
Open science
A direction of travel?
Collecting
the data
Doing
research
Doing science
openly
Researchers - Govt & Public sector - Businesses - Citizens - Citizen scientists
(communication/dialogue – joint production of knowledge)
Stakeholders
• Communication/dialogue must be audience-sensitive
• Is it – with all stakeholder groups?
37.
Summary and maybe
conclusions?
• We need to open the products of research
• to encourage innovation and collaboration
• to give credit to the people who’ve created
them
• to be transparent and trustworthy
• Openness does come at a cost!
• It’s not enough for data to be open
• it needs to be usable and understandable
too
• Data citation and publication are ways of
encouraging researchers to make their data
open
• or at least tell the world that their data exists!
• We need a culture change – but it’s
already happening!
http://www.keepcalm-o-matic.co.uk/default.asp
38.
Thanks!
Any questions?
sarah.callaghan@stfc.ac.uk
@sorcha_ni
http://citingbytes.blogspot.co.uk/
“Publishing research without data is simply
advertising, not science” - Graham Steel
http://blog.okfn.org/2013/09/03/publishing-research-without-data-is-simply-advertising-not-science/
http://heywhipple.com/dont-show-me-a-something-
about-show-me-something/
Notes de l'éditeur
This is Henry Oldenberg, the first secretary of the newly formed Royal Society in the early 1660s. Henry was an inveterate correspondent, with those we would now call scientists both in Europe and beyond. Rather than keep this correspondence private, he thought it would be a good idea to publish it, and persuaded the new Society to do so by creating the Philosophical Transactions, which remains a top-flight journal to the present day. But he demanded two things of his correspondents: that they should submit in the vernacular and not Latin; and that evidence (data) that supported a concept must be published together with the concept. It permitted others to scrutinize the logic of the concept, the extent to which it was supported by the data and permitted replication and re-use. Open publication of concept and evidence is the basis of “scientific self-correction”, which historians of science argue were the crucial building blocks on which the scientific revolution of the 18th and 19th centuries was built and remain fundamental to the progress of science. Openness to scrutiny by scientific peers is the most powerful form of peer review.
The fundamental challenge is to scientific self-correction. Journals can no longer contain the data, and neither scientists nor journals have taken the obvious step of having data relevant to a publication concurrently available in an electronic database. (example of last year’s Nature paper revealing that only 11% of results in 50 benchmark papers in pre-clinical oncology were replicable. If lack of Oldenburg’s rigour in presenting evidence is widespread, a failure of replicability risks undermines science as a reliable way of acquiring knowledge and can therefore undermines its credibility.
Lots of interchangeable and fluid terms but many shared principles. The word “science” is used to mean the systematic organisation of knowledge that can be rationally explained and reliably applied. It is not exclusively restricted to “natural science”.
Il semblerait que vous ayez déjà ajouté cette diapositive à .
Créer un clipboard
Vous avez clippé votre première diapositive !
En clippant ainsi les diapos qui vous intéressent, vous pourrez les revoir plus tard. Personnalisez le nom d’un clipboard pour mettre de côté vos diapositives.
Créer un clipboard
Partager ce SlideShare
Vous avez les pubs en horreur?
Obtenez SlideShare sans publicité
Bénéficiez d'un accès à des millions de présentations, documents, e-books, de livres audio, de magazines et bien plus encore, sans la moindre publicité.
Offre spéciale pour les lecteurs de SlideShare
Juste pour vous: Essai GRATUIT de 60 jours dans la plus grande bibliothèque numérique du monde.
La famille SlideShare vient de s'agrandir. Profitez de l'accès à des millions de livres numériques, livres audio, magazines et bien plus encore sur Scribd.
Apparemment, vous utilisez un bloqueur de publicités qui est en cours d'exécution. En ajoutant SlideShare à la liste blanche de votre bloqueur de publicités, vous soutenez notre communauté de créateurs de contenu.
Vous détestez les publicités?
Nous avons mis à jour notre politique de confidentialité.
Nous avons mis à jour notre politique de confidentialité pour nous conformer à l'évolution des réglementations mondiales en matière de confidentialité et pour vous informer de la manière dont nous utilisons vos données de façon limitée.
Vous pouvez consulter les détails ci-dessous. En cliquant sur Accepter, vous acceptez la politique de confidentialité mise à jour.