Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Open Data in a Big Data World: easy to say, but hard to do?

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 41 Publicité

Open Data in a Big Data World: easy to say, but hard to do?

Télécharger pour lire hors ligne

Presentation at 3rd LEARN workshop on Research Data Management, “Make research data management policies work”
Helsinki, 28 June 2016, by Sarah Callaghan, STFC Rutherford Appleton Laboratory

Presentation at 3rd LEARN workshop on Research Data Management, “Make research data management policies work”
Helsinki, 28 June 2016, by Sarah Callaghan, STFC Rutherford Appleton Laboratory

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Open Data in a Big Data World: easy to say, but hard to do? (20)

Publicité

Plus par LEARN Project (20)

Plus récents (20)

Publicité

Open Data in a Big Data World: easy to say, but hard to do?

  1. 1. Open Data in a Big Data World: easy to say, but hard to do? Sarah Callaghan sarah.callaghan@stfc.ac.uk @sorcha_ni ORCID: 0000-0002-0517-1031 Geoffrey Boulton, Dominique Babini, Simon Hodson, Jianhui Li, Tshilidzi Marwala, Maria Musoke, Paul Uhlir, Sally Wyatt 3rd LEARN workshop on Research Data Management, “Make research data management policies work” Helsinki, 28 June 2016
  2. 2. Principles, Policies & Practice Responsibilities 1-2. Scientists 3.Research institutions & universities 4.Publishers 5.Funding agencies 6.Scholarly societies and academies 7.Libraries & repositories 8. Boundaries of openness Enabling practices 9. Citation and provenance 10. Interoperability 11. Non-restrictive re-use 12. Linkability http://www.icsu.org/science- international/accord
  3. 3. The Data Deluge http://www.economist.com/node/21521549 http://www.leadformix.com/blog/2013/02/the-big-data-deluge/
  4. 4. It used to be “easy”… Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665 The Scientific Papers of William Parsons, Third Earl of Rosse 1800-1867 …but datasets have gotten so big, it’s not useful to publish them in hard copy anymore
  5. 5. Hard copy of the Human Genome at the Wellcome Collection
  6. 6. Example Big Data: CMIP5 CMIP5: Fifth Coupled Model Intercomparison Project • Global community activity under the World Meteorological Organisation (WMO) via the World Climate Research Programme (WCRP) •Aim: – to address outstanding scientific questions that arose as part of the 4th Assessment Report process, – improve understanding of climate, and – to provide estimates of future climate change that will be useful to those considering its possible consequences. Many distinct experiments, with very different characteristics, which influence the configuration of the models, (what they can do, and how they should be interpreted).
  7. 7. Simulations: ~ 90,000 years ~ 60 experiments ~ 20 modelling centres (from around the world) using ~ 30 major(*) model configurations ~ 2 million output “atomic” datasets ~ 10's of petabytes of output ~ 2 petabytes of CMIP5 requested output ~ 1 petabyte of CMIP5 “replicated” output Which are replicated at a number of sites (including ours) Major international collaboration! Funded by EU FP7 projects (IS-ENES2, Metafor) and US (ESG) and other national sources (e.g. NERC for the UK) CMIP5 numbers
  8. 8. 10 Summary of the CMIP5 example The Climate problem needs: – Major physical e-infrastructure (networks, supercomputers) – Comprehensive information architectures covering the whole information life cycle, including annotation (particularly of quality) … and hard work populating these information objects, particularly with provenance detail. – Sophisticated tools to produce and consume the data and information objects – State of the art access control techniques Major distributed systems are social challenges as much as technical challenges. CMIP5 is Big Data, with lots of different participants and lots of different technologies. It also has a community willing to work together to standardise and automate data and metadata production and curation, and with the willingness to support the effort needed for openness.
  9. 9. Big Data: •Industrialised and standardised data and metadata production •Large groups of people involved •Methods for making the data open, attribution and credit for data creation established Long Tail Data: •Bespoke data and metadata creation methods •Small groups/lone researchers •No generally accepted methods for attribution and credit for data creation. Often data is closed due to lack of effort to open it https://flic.kr/p/g1EHPR
  10. 10. Most people have an idea of what a publication is
  11. 11. Some examples of data (just from the Earth Sciences) 1. Time series, some still being updated e.g. meteorological measurements 2. Large 4D synthesised datasets, e.g. Climate, Oceanographic, Hydrological and Numerical Weather Prediction model data generated on a supercomputer 3. 2D scans e.g. satellite data, weather radar data 4. 2D snapshots, e.g. cloud camera 5. Traces through a changing medium, e.g. radiosonde launches, aircraft flights, ocean salinity and temperature 6. Datasets consisting of data from multiple instruments as part of the same measurement campaign 7. Physical samples, e.g. fossils
  12. 12. Open Data is not a new idea Henry Oldenburg
  13. 13. Data, Reproducibility and Science Science should be reproducible – other people doing the same experiments in the same way should get the same results. Observational data is not reproducible (unless you have a time machine) Therefore we need to have access to the data to confirm the science is valid! Poor data analysis generates false facts – and false facts & inaccessible data undermine science & its credibility http://www.flickr.com/photos/31333486@N00/1893012324/siz es/o/in/photostream/
  14. 14. A crisis of reproducibility and credibility? The data providing the evidence for a published concept MUST be concurrently published, together with the metadata. To do otherwise is scientific MALPRACTICE Pre-clinical oncology – 89% not reproducible Why? •Misconduct/fraud •Invalid reasoning •Absent or inadequate data and/or metadata
  15. 15. We’re only going to get more data More big data - linked data – machine learning The internet of things So, what must we do? •Concurrently publish data and metadata that are the evidence for a published scientific claim – to do otherwise is malpractice •Data science skills for researchers •Re-establish standards of reproducibility for a data-intensive age
  16. 16. • Patterns not hitherto seen • Unsuspected relationships • Integrated analysis of diverse data (e.g. natural & social science) • Complex systems e.g. complexity: dynamic evolution and system state But not all research is or needs to be data-intensive Scientific Opportunities of Big Data https://www.clickz.com/clic kz/column/2389218/create -better-content-via-humor
  17. 17. http://www.tylervigen.com/spurious-correlations Caveat Emptor!
  18. 18. Data supporting a published claim Other data for re-use & integration Pillars of the Digital Revolution Big Data Volume Velocity Variety Veracity Linked Data Many databases Semantic Relations Deeper meaning Foundations : Openness Machine analysis & learning The Open Data Edifice
  19. 19. Open Data initiatives in areas of: Life sciences Earth Science, Environmental Science Food Science Agricultural Science Chemical Crystallography Bioinformatics/Genomics Linguistics Social Sciences Evolutionary biology Biodiversity Astronomy Earth Observation (GEO) Archaeology Atmospheric sciences EMBL-EBI services Labs around the world send us their data and we… Archive it Classify it Share it with other data providers Analyse, add value and integrate it …provide tools to help researchers use it A collaborative enterprise Elixir programme It is happening: bottom- up Open Data initiatives
  20. 20. The Open Data Iceberg The Technical Challenge The Consent Challenge The Institutional Challenge The Funding Challenge The Support Challenge The Skills Challenge The Incentives Challenge The Mindset Challenge Processes & Organisation People Developed from: Deetjen, U., E. T. Meyer and R. Schroeder (2015). OECD Digital Economy Papers, No. 246, OECD A National Infrastructure Technology
  21. 21. Scientists i.Publicly funded scientists have a responsibility to contribute to the public good through the creation and communication of new knowledge, of which associated data are intrinsic parts. They should make such data openly available to others as soon as possible after their production in ways that permit them to be re-used and re- purposed. ii. The data that provide evidence for published scientific claims should be made concurrently and publicly available in an intelligently open form. This should permit the logic of the link between data and claim to be rigorously scrutinised and the validity of the data to be tested by replication of experiments or observations. To the extent possible, data should be deposited in well-managed and trusted repositories with low access barriers. From the Accord: Responsibilities
  22. 22. Creating a dataset is hard work! "Piled Higher and Deeper" by Jorge Cham www.phdcomics.com Documenting a dataset so that it is usable and understandable by others is extra work!
  23. 23. “I’m all for the free sharing of information, provided it’s them sharing their information with us.” http://discworld.wikia.com/wiki/Mustrum_Ri dcully Mustrum Ridcully, D.Thau., D.M., D.S., D.Mn., D.G., D.D., D.C.L., D.M. Phil., D.M.S., D.C.M., D.W., B.El.L, Archancellor, Unseen University, Anhk- Morpork, Discworld - As quoted in “Unseen Academicals”, by Terry Pratchett
  24. 24. Open is not enough! “When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26- edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.” - http://ivory.idyll.org/blog/data- management.html https://flic.kr/p/awnCQu
  25. 25. Incentives for Open Data • Need reward structures and incentives for researchers to encourage them to make their data open • Data citation and publication • (again, issues with treating data as a special case of publications…)
  26. 26. The Understandability Challenge: Article
  27. 27. What the data set looks like on disk What the raw data files look like. I could make these files open easily, but no one would have a clue how to use them! The Understandability Challenge: Data
  28. 28. It’s ok, I’ll just put it out there and if it’s important other people will figure it out These documents have been preserved for thousands of years! But they’ve both been translated many times, with different meanings each time. We need Metadata to preserve Information We can’t rely on Data Archaeology Phaistos Disk, 1700BC
  29. 29. http://theupturnedmicroscope.com/comi c/negative-data/
  30. 30. It’s not just data! • Experimental protocols • Workflows • Software code • Metadata • Things that went wrong! • …
  31. 31. Usability, trust, metadata http://trollcats.com/2009/11/im-your-friend-and-i- only-want-whats-best-for-you-trollcat/ When you read a journal paper, it’s easy to read and get a quick understanding of the quality of the paper. You don’t want to be downloading many GB of dataset to open it and see if it’s any use to you. Need to use proxies for quality: •Do you know the data source/repository? Can you trust it? •Is there enough metadata so that you can understand and/or use the data? In the same way that not all journal publishers are created equal, not all data repositories are created equal Example metadata from a published dataset: “rain.csv contains rainfall in mm for each month at Marysville, Victoria from January 1995 to February 2009” Lindenmayer, David B.; Wood, Jeff; McBurney, Lachlan; Michael, Damian; Crane, Mason; MacGregor, Christopher; Montague-Drake, Rebecca; Gibbons, Philip; Banks, Sam C.; (2011): rain; Dryad Digital Repository. http://doi.org/10.5061/DRYAD.QP1F6H0S/3
  32. 32. Should ALL data be open? Most data produced through publically funded research should be open. But! • Confidentiality issues (e.g. named persons’ health records) • Conservation issues (e.g. maps of locations of rare animals at risk from poachers) • Security issues (e.g. data and methodologies for building biological weapons) There should be a very good reason for publically funded data to not be open.
  33. 33. Getting scooped http://www.phdcomics.com/comics/archive.php?comicid=795 It happened to me! I shared my data with another research group. They published the first results using that data. I wasn’t a co-author. I didn’t get an acknowledgement.
  34. 34. Citeable does not equal Open! Just like you can cite a paper that is behind a paywall, you can cite a dataset that isn’t open. Making something citeable means that: • You know it exists • You know who’s responsible for it • You know where to find it • You know a little bit about it (title, abstract,…) Even if you can’t download/read the thing yourself. Citation gives benefits that encourage data producers to make their data open
  35. 35. Be careful of your citations!
  36. 36. Inputs Outputs Open access Administrative data (held by public authorities e.g. prescription data) Public Sector Research data (e.g. Met Office weather data) Research Data (e.g. CERN, generated in universities) Research publications (i.e. papers in journals) Open data Open science A direction of travel? Collecting the data Doing research Doing science openly Researchers - Govt & Public sector - Businesses - Citizens - Citizen scientists (communication/dialogue – joint production of knowledge) Stakeholders • Communication/dialogue must be audience-sensitive • Is it – with all stakeholder groups?
  37. 37. Summary and maybe conclusions? • We need to open the products of research • to encourage innovation and collaboration • to give credit to the people who’ve created them • to be transparent and trustworthy • Openness does come at a cost! • It’s not enough for data to be open • it needs to be usable and understandable too • Data citation and publication are ways of encouraging researchers to make their data open • or at least tell the world that their data exists! • We need a culture change – but it’s already happening! http://www.keepcalm-o-matic.co.uk/default.asp
  38. 38. Thanks! Any questions? sarah.callaghan@stfc.ac.uk @sorcha_ni http://citingbytes.blogspot.co.uk/ “Publishing research without data is simply advertising, not science” - Graham Steel http://blog.okfn.org/2013/09/03/publishing-research-without-data-is-simply-advertising-not-science/ http://heywhipple.com/dont-show-me-a-something- about-show-me-something/

Notes de l'éditeur

  • This is Henry Oldenberg, the first secretary of the newly formed Royal Society in the early 1660s. Henry was an inveterate correspondent, with those we would now call scientists both in Europe and beyond. Rather than keep this correspondence private, he thought it would be a good idea to publish it, and persuaded the new Society to do so by creating the Philosophical Transactions, which remains a top-flight journal to the present day. But he demanded two things of his correspondents: that they should submit in the vernacular and not Latin; and that evidence (data) that supported a concept must be published together with the concept. It permitted others to scrutinize the logic of the concept, the extent to which it was supported by the data and permitted replication and re-use. Open publication of concept and evidence is the basis of “scientific self-correction”, which historians of science argue were the crucial building blocks on which the scientific revolution of the 18th and 19th centuries was built and remain fundamental to the progress of science. Openness to scrutiny by scientific peers is the most powerful form of peer review.
  • The fundamental challenge is to scientific self-correction. Journals can no longer contain the data, and neither scientists nor journals have taken the obvious step of having data relevant to a publication concurrently available in an electronic database. (example of last year’s Nature paper revealing that only 11% of results in 50 benchmark papers in pre-clinical oncology were replicable. If lack of Oldenburg’s rigour in presenting evidence is widespread, a failure of replicability risks undermines science as a reliable way of acquiring knowledge and can therefore undermines its credibility.
  • Lots of interchangeable and fluid terms but many shared principles.
    The word “science” is used to mean the systematic organisation of knowledge that can be rationally explained and reliably applied. It is not exclusively restricted to “natural science”.

×