BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
Reproducibility challenges in computational settings: what are they, why should we address them, and how?
1. Reproducibility Challenges
in Computational Settings:
What are they,
why should we address them, and how?
Andreas Rauber
Vienna University of Technology
rauber@ifs.tuwien.ac.at
http://www.ifs.tuwien.ac.at/~andi
2. Outline
What are the challenges in reproducibility?
What do we gain from reproducibility?
(and: why is non-reproducibility interesting?)
How to address the challenges of complex processes?
How to deal with “Big Data”?
Summary
3. Challenges in Reproducibility
Challenges in reproducibility
or: Why data sharing is not enough
FAIR principles are a necessity
Data Management and DMPs are a necessity
But they are not sufficient if we want to
- Ensure reproducibility
- Enable metastudies
- Benefit from efficient eScience
…unless we define data broader than we commonly
tend to do
10. Challenges in Reproducibility
Large scale quantitative analysis
Obtain workflows from MyExperiments.org
- March 2015: almost 2.700 WFs (approx. 300-400/year)
- Focus on Taverna 2 WFs: 1.443 WFs
- Published by authors should be „better quality“
Try to re-execute the workflows
- Record data on the reasons for failure along
Analyse the most common reasons for failures
11. Re-Execution results
Majority of workflows fails
Only 23.6 % are
successfully executed
- No analysis yet on
correctness of results…
Challenges in Reproducibility
Rudolf Mayer, Andreas Rauber, “A Quantitative Study on
the Re-executability of Publicly Shared Scientific
Workflows”, 11th IEEE Intl. Conference on e-Science,
2015.
12. Computer Science
613 papers in 8 ACM conferences
Process
- download paper and classify
- search for a link to code (paper, web, email twice)
- download code
- build and execute
Christian Collberg and Todd Proebsting. “Repeatability in
Computer Systems Research,” CACM 59(3):62-69.2016
13. In a nutshell – and another aspect of reproducibility:
Challenges in Reproducibility
Source: xkcd
14. Reproducibility – solved! (?)
Reproducibility is more than just sharing the data!
Provide source code, parameters, data, …
Ensure that it works:
Wrap it up in a container/virtual machine, …
…
done?
LXC
15. Outline
What are the challenges in reproducibility?
What do we gain by aiming for reproducibility?
How to address the challenges of complex processes?
How to deal with dynamic data?
Summary
16. Reproducibility – solved! (?)
Provide source code, parameters, data, …
Wrap it up in a container/virtual machine, …
…
Why do we want reproducibility?
Which levels or reproducibility are there?
What do we gain by different levels of reproducibility?
LXC
17. Reproducibility – solved! (?)
Dagstuhl Seminar:
Reproducibility of Data-Oriented Experiments in e-Science
January 2016, Dagstuhl, Germany
18. Types of Reproducibility
The PRIMAD1
model: which attributes can we “prime”?
- Data
• Parameters
• Input data
- Plattform
- Implementation
- Method
- Research Objective
- Actors
What do we gain by priming one or the other?
[1] Juliana Freire, Norbert Fuhr, and Andreas Rauber. Reproducibility of Data-Oriented
Experiments in eScience. Dagstuhl Reports, 6(1), 2016.
20. Reproducibility Papers
Aim for reproducibility: for one’s own sake – and as Chairs of conference tracks, editor, reviewer, superviser, …
- Review of reproducibility of submitted work (material provided)
- Encouraging reproducibility studies
- (Messages to stakeholders in Dagstuhl Report)
Consistency of results, not identity!
Reproducibility studies and papers
- Not just re-running code / a virtual machine
- When is a reproducibility paper worth the effort /
worth being published?
22. Learning from Non-Reproducibility
Do we always want reproducibility?
- Scientifically speaking: yes!
Research is addressing challenges:
- Looking for and learning from non-reproducibility!
Non-reproducibility if
- Some (un-known) aspect of a study influences results
- Technical: parameter sweep, bug in code, OS, … -> fix it!
- Non-technical: input data! (specifically: “the user”)
23. Learning from Non-Reproducibility
Challenges in MIR – “things sometimes don’t seem to work”
Virtual Box, Github, <your favourite tool> are starting points
Same features, same algorithm, different data ->
Same data, different listeners ->
Understanding “the rest”:
- Isolating unknown influence factors
- Generating hypotheses
- Verifying these to understand the “entire system”,
cultural and other biases, …
Benchmarks and Meta-Studies
24. Reproducibility – solved! (?)
Provide source code, parameters, data, …
Wrap it up in a container/virtual machine,
Provide context information
Encourage reproducibility studies beyond re-running
Use it to establish trust in your research & gain new insights
done?
LXC
25. Outline
What are the challenges in reproducibility?
What do we gain by aiming for reproducibility?
How to address the challenges of complex processes?
How to deal with “Big Data”?
Summary
27. And the solution is…
Standardization and Documentation
- Standardized components, procedures, workflows
- Documenting complete system set-up across
entire provenance chain
How to do this – efficiently?
Alexander Graham Bell’s Notebook, March 9 1876
https://commons.wikimedia.org/wiki/File:Alexander_Graham_Bell's_notebook,_March_9,_1876.PNG
Pieter Bruegel the Elder: De Alchemist (British Museum, London)
28. Documenting a Process
Context Model: establish what to document and how
Meta-model for describing process & context
- Extensible architecture integrated by core model
- Reusing existing models as much as possible
- Based on ArchiMate, implemented using OWL
Extracted by static and dynamic analysis
29. Context Model – Static Analysis
Analyses steps, platforms, services, tools called
Dependencies (packages, libraries)
HW, SW Licenses, …
Taverna Workflow
ArchiMate model
Context Model
(OWL ontology)
#!/bin/bash
# fetch data
java -jar GestBarragensWSClientIQData.jar
unzip -o IQData.zip
# fix encoding
#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r
# generate references
R --vanilla < iq_utf8.r > IQout.txt
# create pdf
pdflatex iq.tex
pdflatex iq.tex
Script
30. Context Model – Dynamic Analysis
Process Migration Framework (PMF)
- designed for automatic redeployments into virtual machines
- uses strace to monitor system calls
- complete log of all accessed resources (files, ports)
- captures and stores process instance data
- analyse resources (file formats via PRONOM, PREMIS)
37. #!/bin/bash
# fetch data
java -jar GestBarragensWSClientIQData.jar
unzip -o IQData.zip
# fix encoding
#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r
# generate references
R --vanilla < iq_utf8.r > IQout.txt
# create pdf
pdflatex iq.tex
pdflatex iq.tex
VFramework
ADDED
NOT USED
38. Reproducibility – solved! (?)
Provide source code, parameters, data, …
Wrap it up in a container/virtual machine,
Provide context information
Encourage reproducibility studies beyond re-running
Use it to establish trust in your research & gain new insights
(automatically) capture process execution context
Verify re-executions
done?
LXC
39. Outline
What are the challenges in reproducibility?
What do we gain by aiming for reproducibility?
How to address the challenges of complex processes?
How to deal with “Big Data”?
Summary
40. Research Data Alliance
WG on Data Citation:
Making Dynamic Data Citeable
WG endorsed in March 2014
- Concentrating on the problems of
large, dynamic (changing) datasets
- Focus! Identification of data!
Not: PID systems, metadata, citation string, attribution, …
- Liaise with other WGs and initiatives on data citation
(CODATA, DataCite, Force11, …)
- https://rd-alliance.org/working-groups/data-citation-wg.html
RDA WG Data Citation
41. Data Citation – Output
14 Recommendations
grouped into 4 phases:
- Preparing data and query store
- Persistently identifying specific data sets
- Resolving PIDs
- Upon modifications to the data
infrastructure
2-page flyer
https://
rd-alliance.org/recommendations-working-group-data-citation-revision-oct-20-2015.htm
More detailed report: IEEE TCDL 2016
http://
www.ieee-tcdl.org/Bulletin/v12n1/papers/IEEE-TCDL-DC-2016_paper_1.pdf
42. Data Citation – Output
14 Recommendations
grouped into 4 phases:
- Preparing data and query store
- Persistently identifying specific data sets
- Resolving PIDs
- Upon modifications to the data
infrastructure
2-page flyer
https://rd-alliance.org/recommendations-working-
group-data-citation-revision-oct-20-2015.html
More detailed report: IEEE TCDL 2016
http://www.ieee-tcdl.org/Bulletin/v12n1/papers/IEEE-
TCDL-DC-2016_paper_1.pdf
Detailed presentation on
Tuesday, Session 9,
12:00-13:30
43. 3 Take-Away Messages
Message 1
Aim at achieving reproducibility at different levels
- Re-run, ask others to re-run
- Re-implement
- Port to different platforms
- Test on different data,
vary parameters (and report!)
If something is not reproducible -> investigate!
(you might be onto something!)
Encourage reproducibility studies!
44. 3 Take-Away Messages
Message 2
Aim for better procedures and documentation
Document the research process, environment,
interim results, …
(preferably automatically, 80:20, …)
The process is part of the data (and vice versa)
Source: xkdc
Pieter Bruegel the Elder: De
Alchemist (British Museum, London)
Research Objects, Context
Models, VFramework
45. 3 Take-Away Messages
Message 3
Aim for proper (research) data management
(not just in academia!)
Data Management Plans, Research Infrastructure Services
Source: http://www.phdcomics.com/comics.php?f=1323 RDA WGDC: Dynamic Data Citation
Detailed presentation on
Tuesday, Session 9,
12:00-13:30
46. Summary
Trustworthy and efficient e-Science
Need to move beyond preserving code + data
Need to move beyond the focus on description
Capture Process and entire execution context
Precisely identify data used in process
Verification of re-execution
Data and process re-use as basis for data driven science
- evidence
- investment
- efficiency
Trust!!
47. Summary
Preaching and eating…
Do we do all this in our lab for our experiments?
No! (not yet?)
Researchers (also in CS) need assistance
Institutions and Research Infrastructures
… and some research on open questions on how to best do
all of this (but mind the infamous 80:20 rule)
48. Summary
C. Glenn Begley, Alastair M. Buchan, Ulrich Dirnagl: Robust research: Institutions must do
their part for reproducibility, Nature 525(7567), Sep 3 2015, Illustration by David Parkins
http://www.nature.com/news/robust-research-institutions-must-do-their-part-for-
reproducibility-1.18259?WT.mc_id=SFB_NNEWS_1508_RHBox
49. Acknowledgements
Johannes Binder
Rudolf Mayer
Tomasz Miksa
Stefan Pröll
Stephan Strodl
Marco Unterberger
TIMBUS
SBA: Secure Business Austria
RDA: Research Data Alliance WGDC
50. References
Juliana Freire, Norbert Fuhr, and Andreas Rauber. Reproducibility of Data-Oriented
Experiments in eScience. Dagstuhl Reports, 6(1), 2016.
Andreas Rauber, Ari Asmi, Dieter van Uytvanck and Stefan Proell. Identification of
Reproducible Subsets for Data Citation, Sharing and Re-Use. Bulletin of IEEE Technical
Committee on Digital Libraries (TCDL), vol. 12, 2016.
Andreas Rauber, Tomasz Miksa, Rudolf Mayer and Stefan Proell. Repeatability and Re-
Usability in Scientific Processes: Process Context, Data Identification and Verification. In
Proceedings of the 17th International Conference on Data Analytics and Management in
Data Intensive Domains (DAMDID), 2015.
Tomasz Miksa, Rudolf Mayer and Andreas Rauber. Ensuring sustainability of web services
dependent processes. International Journal of Computational Science and Engineering
(IJCSE). 2015 Vol.10, No.1/2, pp.70 – 81
Rudolf Mayer and Andreas Rauber, A Quantitative Study on the Re-executability of
Publicly Shared Scientific Workflows. 11th IEEE Intl. Conference on e-Science, 2015.
Rudolf Mayer, Tomasz Miksa and Andreas Rauber. Ontologies for describing the context of
scientific experiment processes. 10th IEEE Intl. Conference on e-Science, 2014.
Tomasz Miksa, Stefan Proell, Rudolf Mayer, Stephan Strodl, Ricardo Vieira, Jose Barateiro
and Andreas Rauber, Framework for verification of preserved and redeployed processes.
10th International Conference on Preservation of Digital Objects (IPRES2013), 2013.
Tomasz Miksa, Stephan Strodl and Andreas Rauber, Process Management Plans.
International Journal of Digital Curation, Vol 9, No 1 (2014),pp. 83-97.
- test instance selection
- local dependencies
- external dependnecies
- provenance data
- state of the context model [integrated + scheme] (discussion on the minimal context model)
- test instance selection
- local dependencies
- external dependnecies
- provenance data
- state of the context model [integrated + scheme] (discussion on the minimal context model)