1st LEARN Workshop. Embedding Research Data as part of the research cycle. 29 Jan 2016. Presentation by Sabina Leonelli, Exeter Centre for the Study of Life Sciences (Egenis) & Department of Sociology, Philosophy and Anthropology, University of Exeter
The Challenges of Making Data Travel, by Sabina Leonelli
1. The Challenges of Making Data Travel
Sabina Leonelli
Exeter Centre for the Study of Life Sciences (Egenis)
& Department of Sociology, Philosophy and
Anthropology
University of Exeter
@sabinaleonelli
www.datastudies.eu
2. Outline
• The Potential of Open Data
• Data Journeys:
– Challenges of collection
– Challenges of re-use
– Challenges of openness
– The Open Data divide
• Conclusions
3. Openness in Science
Long history of openness as a key norm for science: public scrutiny,
transparency and reproducibility of results define what science is,
how it works, what counts as a research output
Equally long history of reasons why it does not work in practice:
• Trust system where scrutiny is delegated to specialists
• Long paths from data generation to discovery
• Strong incentives provided by commercialisation and competition,
with associated intellectual property regimes around research
results (and conflicting interests of research sponsors and
institutions)
• Practical difficulties in disseminating and reproducing data,
software, techniques and materials, vis-à-vis research articles
• Publication regime itself increasingly commercialised
4. What makes Open Data valuable now?
• Potential to improve
– pathways to and quality of discoveries
– uptake of new technologies
– collaborative efforts across disciplines, nations and expertises
– research evaluation, debate and transparency
– appropriate valuation of research components beyond papers and patents
– fight against fraud, low quality and duplication of efforts
– legitimacy of science and public trust
– public understanding and participation
• Open Data as a platform to debate what counts as science, scientific
infrastructures and scientific governance, and how results should be
credited and disseminated
• Making data open means making data mobile and useful across sites,
contexts, uses: major challenges to realising that potential
• My concern: examining conditions under which the potential of data as
evidence for scientific claims can be realised sustainably in the long term
5. Researching Data Journeys
Investigating the conceptual/material/institutional labor involved in
making data travel from sites of production to sites of (re-)use
• Digital data infrastructures as sites for data movements and
integration across a wide variety of sources and perspectives
• Situations of data uptake and re-use in developed and developing
world (ongoing studies in UK, USA, Kenya, South Africa)
• Methods: history, philosophy and social studies of science
– Archival research
– Ethnographies and interviews on attitudes to openness, curation
practices and re-use
– Collaboration with researchers
• Policy involvement:
– Lead for Open Science working group of the Global Young Academy
(e.g. Access to Open Software Survey – Nigeria, Ghana, Bangladesh)
– Chair of ongoing Open Data consultation across European YAs
6. Research Data Management Across Disciplines
Scientific realms under investigation:
• model organism research: data on different aspects of same organism
• plant science: environmental, phenotypic and omics data
• biomedicine: clinical, crowdsourced, biological data
• oceanography: geological, geographical, metereological, biological data
• archaeology, particle physics, climate science, economics
Parameters of comparison:
• Subject matter (complex objects versus simplified models)
• Data source (one or multiple disciplines)
• Data production mode (centralised vs dispersed; highly automated vs
system-specific)
• Data types (ease of dissemination and analysis, size, relation to software)
• Publication cultures and collaborative ethos
• Geographical locations, types and sources of funding involved
• Availability of relevant data (and other) infrastructures
• Ethical concerns and regulation
9. Challenges of Collection
Data sharing needs to be extensive, comprehensive, global
and long-term. This requires:
• Habitual data donation: challenge to current credit systems
and research practices, given considerable labor involved (NB:
when adopted as community ethos, huge boost to research)
• Adequate standards & guidelines for data formatting:
problematic given large diversity of methods & terminologies
• Well-organised databases: intelligent and labor-intensive
curation to avoid ‘data dumps’
• Sharing of related materials: reliable stock centres and
collections, rarely available & well-coordinated with databases
• Diversity of data types: now emphasis on cheap and easy
quantitative measurements
• Sustainability in time:
– commitment to data infrastructures beyond short term
– continuous updates of data standards and classification to
keep up with shifts in technology and knowledge
10. Challenges of Re-Use
• Qualitative results: very limited re-use*. Why?
• Misalignment between IT solutions and research
questions/needs/situations; problems with access to related
software
• Substantive disagreement over data management:
– methods, terminologies, standards involved in data production
and interpretation
– what counts as data in the first place (data as a relational
category)
• Re-use often linked to participation in developing data
infrastructures rarely the case for busy practitioners, also
gap in skills
• Conflation of epistemic and economic value of data wish
to capitalise on past investments risks encouraging
conservatism (building on old data instead of pursuing new
11. Challenges of Openness
• Semantic ambiguity: Openness means different things to different
people, even in same discipline (e.g. free of license, free of
ownership, under CC-BY license, common good, good enough to
share, unrestricted access and/or use, accessible without payment,
unclear/open to interpretation..) – explicit debate is key
• Problematic implementation: research ethos, career structures &
incentives lag behind; strong disincentives in competitive fields;
publication pressure leads to information control
• IP: confusion around which modes of intellectual property apply,
and to whom (individual researchers, labs, projects, networks,
universities, funders)
• Social & ethical concerns: data as tokens of personal identity
• Universities and the state: confusion around Open Data policies
perceived and perceived tensions with metrics of excellence and
impact (e.g. UK)
12. The Open Data Divide
High-resource bias: richer labs struggle to comply, poorer labs are left
behind and/or choose not to participate
• databases mostly display outputs of top English-speaking labs, which
have funds to curate contents, visibility to determine dissemination
formats/procedures, resources and confidence to build on data
donated by others
• involvement of poor/unfashionable labs, scientists in middle-low-
income countries, non-scientists remains low & at ‘receiving’ end
• few provisions for situations of systematic disadvantage (e.g. lack of
infrastructures and online access, funding, governmental support,
expertise, materials; teaching demands; power cuts and transport
delays) and vulnerability (e.g. where access to a resource/location is
what gives competitive edge, as in archaeology, botany)
• low-resourced researchers are reluctant to contribute, fear it will
undermine rather than increase international credibility
13. Conclusions
1. OD is Not Quick Nor Cheap
1. Open to What and When?
2. Link between OD and Access to Software
3. Estimating Prospective Value vs Preserving Open-Endedness
Meanings of openness in Oxford English Dictionary:
1. ‘free’ (of..)
2. ‘accessible, exposed, unrestricted’
3. ‘available, reusable’
4. ‘flexible, unpredictable, uncertain, unsettled’
Policy and scientific discourse centers around 1-3, and yet 4 is crucial
to science
14. Steps Forward: Researchers, Institutions,
Funders and Learned Societies
• Current data collections are very limited in scope and difficult to
re-use by outsiders
• Careful consideration needs to be given to what is disseminated,
why, how and with which priority and time-line
• Need to promote
– data curation as integral part of research, since being involved in
developing databases is key to effective data re-use
– critical discussions about what counts as data and openness in each
research community / centre / project, taking account of specific ethical,
legal and political concerns
• Crucial role of learned societies and funders in informing
researchers as well as policy-makers of shifting needs, resources
and constrains for each field
• Beware of the term “sharing”: it suggests, but does not entail,
reciprocity and common ground
15. With thanks to the Exeter Data Studies Group:
Brian Rappert
Louise Bezuidenhout
Ann Kelly
Niccolo Tempini
Gregor Halfmann
Rachel Ankeny
Main reference: Leonelli, Sabina (2016, in press) Data-Centric Biology: A
Philosophical Study. Chicago, Il: The University of Chicago Press.
For other relevant publications, see www.datastudies.eu, @DataScienceFeed
This research was funded by the European Research Council under the European
Union's Seventh Framework Programme (FP7/2007-2013) / ERC grant agreement
n° 335925; the UK Economic and Social Research Council (ESRC), grant number
ES/F028180/1; and the Leverhulme Trust, grant award RPG-2013-153.
15www.datastudies.eu
Notes de l'éditeur
The enormous potential of Open Data within scientific research can be realised by understanding and supporting the specific conditions under which data can be effectively disseminated and re-used. Empirical research shows these conditions to be largely localised and field/application-specific, thus requiring decentralised policies and infrastructures. Attention also needs to be paid to conditions for inclusion in and exclusion from Open Data initiatives.
My work details the revolutionary impact of OS, and particularly Open Data, on the content of biological research, and even on what counts as research in the first place, and for whom
However, major challenges to turning potential into reality
My concern here is with how to ensure that OD potential is fully and sustainably realised, and challenges overcome
Increasing commodification of scientific outputs beyond papers, e.g. data and protocols (the more emphasis on Open Data, the more recognition of multiple ways of valuing data, including economic - Openness underscores and challenges existing property and privacy regimes at the same time
RNA-seq (whole transcriptome shotgun sequencing)
* This is very hard to quantify, requires in-depth case analysis and interviews (consultation of databases does not count, and citations of datasets are not yet well-established enough)
How do leading researchers in biology understand the idea of ‘openness’? How does it relate, if at all, with their practices and preferences? What value does this idea, and its specific manifestations, hold within cutting-edge research?
How do guidelines on Open Data impact research practices in the biosciences?
How do they fit existing practices of sharing, collaboration and mentoring?
What are the incentives, costs, opportunities and problems encountered in implementing them?
How do they fit in with existing policies and norms on intellectual property, innovation and commercialisation, and impact?
Timings matter: when to implement openness is as important as how and what
Importance of temporarily restricting access in order to focus resources and expertise towards data analysis
When is information most fruitfully released at different stages of inquiry? Same issue arises with licensing: timing is as crucial as deciding what to license and how (too early kills R&D, too late and you’re out of competition)
These assessment are unavoidably context-dependent, umbrella policies are key incentives but can only go so far