SlideShare une entreprise Scribd logo
1  sur  51
Reproducibility Challenges
in Computational Settings:
What are they,
why should we address them, and how?
Andreas Rauber
Vienna University of Technology
rauber@ifs.tuwien.ac.at
http://www.ifs.tuwien.ac.at/~andi
Outline
 What are the challenges in reproducibility?
 What do we gain from reproducibility?
(and: why is non-reproducibility interesting?)
 How to address the challenges of complex processes?
 How to deal with “Big Data”?
 Summary
Challenges in Reproducibility
 Challenges in reproducibility
 or: Why data sharing is not enough
 FAIR principles are a necessity
 Data Management and DMPs are a necessity
 But they are not sufficient if we want to
- Ensure reproducibility
- Enable metastudies
- Benefit from efficient eScience
 …unless we define data broader than we commonly
tend to do
Challenges in Reproducibility
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234
Challenges in Reproducibility
 Excursion: Scientific Processes
Challenges in Reproducibility
 Excursion: scientific processes
set1_freq440Hz_Am12.0Hz
set1_freq440Hz_Am05.5Hz
set1_freq440Hz_Am11.0Hz
Java Matlab
Challenges in Reproducibility
 Excursion: Scientific Processes
 Bug?
 Psychoacoustic transformation tables?
 Forgetting a transformation?
 Different implementation of filters?
 Limited accuracy of calculation?
 Difference in FFT implementation?
 ...?
Challenges in Reproducibility
 Workflows
Taverna
Challenges in Reproducibility
Challenges in Reproducibility
 Large scale quantitative analysis
 Obtain workflows from MyExperiments.org
- March 2015: almost 2.700 WFs (approx. 300-400/year)
- Focus on Taverna 2 WFs: 1.443 WFs
- Published by authors  should be „better quality“
 Try to re-execute the workflows
- Record data on the reasons for failure along
 Analyse the most common reasons for failures
Re-Execution results
Majority of workflows fails
Only 23.6 % are
successfully executed
- No analysis yet on
correctness of results…
Challenges in Reproducibility
Rudolf Mayer, Andreas Rauber, “A Quantitative Study on
the Re-executability of Publicly Shared Scientific
Workflows”, 11th IEEE Intl. Conference on e-Science,
2015.
Computer Science
 613 papers in 8 ACM conferences
 Process
- download paper and classify
- search for a link to code (paper, web, email twice)
- download code
- build and execute
Christian Collberg and Todd Proebsting. “Repeatability in
Computer Systems Research,” CACM 59(3):62-69.2016
In a nutshell – and another aspect of reproducibility:
Challenges in Reproducibility
Source: xkcd
Reproducibility – solved! (?)
 Reproducibility is more than just sharing the data!
 Provide source code, parameters, data, …
 Ensure that it works:
Wrap it up in a container/virtual machine, …
…
done?
LXC
Outline
 What are the challenges in reproducibility?
 What do we gain by aiming for reproducibility?
 How to address the challenges of complex processes?
 How to deal with dynamic data?
 Summary
Reproducibility – solved! (?)
 Provide source code, parameters, data, …
 Wrap it up in a container/virtual machine, …
…
 Why do we want reproducibility?
 Which levels or reproducibility are there?
 What do we gain by different levels of reproducibility?
LXC
Reproducibility – solved! (?)
 Dagstuhl Seminar:
Reproducibility of Data-Oriented Experiments in e-Science
January 2016, Dagstuhl, Germany
Types of Reproducibility
 The PRIMAD1
model: which attributes can we “prime”?
- Data
• Parameters
• Input data
- Plattform
- Implementation
- Method
- Research Objective
- Actors
 What do we gain by priming one or the other?
[1] Juliana Freire, Norbert Fuhr, and Andreas Rauber. Reproducibility of Data-Oriented
Experiments in eScience. Dagstuhl Reports, 6(1), 2016.
Types of Reproducibility and Gains
Reproducibility Papers
 Aim for reproducibility: for one’s own sake – and as Chairs of conference tracks, editor, reviewer, superviser, …
- Review of reproducibility of submitted work (material provided)
- Encouraging reproducibility studies
- (Messages to stakeholders in Dagstuhl Report)
 Consistency of results, not identity!
 Reproducibility studies and papers
- Not just re-running code / a virtual machine
- When is a reproducibility paper worth the effort /
worth being published?
Reproducibility Papers
 When is a Reproducibility paper worth being published?
Learning from Non-Reproducibility
 Do we always want reproducibility?
- Scientifically speaking: yes!
 Research is addressing challenges:
- Looking for and learning from non-reproducibility!
 Non-reproducibility if
- Some (un-known) aspect of a study influences results
- Technical: parameter sweep, bug in code, OS, … -> fix it!
- Non-technical: input data! (specifically: “the user”)
Learning from Non-Reproducibility
Challenges in MIR – “things sometimes don’t seem to work”
Virtual Box, Github, <your favourite tool> are starting points
Same features, same algorithm, different data ->
Same data, different listeners ->
Understanding “the rest”:
- Isolating unknown influence factors
- Generating hypotheses
- Verifying these to understand the “entire system”,
cultural and other biases, …
Benchmarks and Meta-Studies
Reproducibility – solved! (?)
 Provide source code, parameters, data, …
 Wrap it up in a container/virtual machine,
 Provide context information
 Encourage reproducibility studies beyond re-running
 Use it to establish trust in your research & gain new insights
done?
LXC
Outline
 What are the challenges in reproducibility?
 What do we gain by aiming for reproducibility?
 How to address the challenges of complex processes?
 How to deal with “Big Data”?
 Summary
Deja-vue…
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234
And the solution is…
 Standardization and Documentation
- Standardized components, procedures, workflows
- Documenting complete system set-up across
entire provenance chain
 How to do this – efficiently?
Alexander Graham Bell’s Notebook, March 9 1876
https://commons.wikimedia.org/wiki/File:Alexander_Graham_Bell's_notebook,_March_9,_1876.PNG
Pieter Bruegel the Elder: De Alchemist (British Museum, London)
Documenting a Process
 Context Model: establish what to document and how
 Meta-model for describing process & context
- Extensible architecture integrated by core model
- Reusing existing models as much as possible
- Based on ArchiMate, implemented using OWL
 Extracted by static and dynamic analysis
Context Model – Static Analysis
 Analyses steps, platforms, services, tools called
 Dependencies (packages, libraries)
 HW, SW Licenses, …
Taverna Workflow
ArchiMate model
Context Model
(OWL ontology)
#!/bin/bash
# fetch data
java -jar GestBarragensWSClientIQData.jar
unzip -o IQData.zip
# fix encoding
#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r
# generate references
R --vanilla < iq_utf8.r > IQout.txt
# create pdf
pdflatex iq.tex
pdflatex iq.tex
Script
Context Model – Dynamic Analysis
 Process Migration Framework (PMF)
- designed for automatic redeployments into virtual machines
- uses strace to monitor system calls
- complete log of all accessed resources (files, ports)
- captures and stores process instance data
- analyse resources (file formats via PRONOM, PREMIS)
Context Model – Dynamic Analysis
Taverna Workflow
VFramework
Are these processes the same?
Original environment Redeployment environmentRepository
Preserve Redeploy
VFramework
VFramework
#!/bin/bash
# fetch data
java -jar GestBarragensWSClientIQData.jar
unzip -o IQData.zip
# fix encoding
#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r
# generate references
R --vanilla < iq_utf8.r > IQout.txt
# create pdf
pdflatex iq.tex
pdflatex iq.tex
VFramework
#!/bin/bash
# fetch data
java -jar GestBarragensWSClientIQData.jar
unzip -o IQData.zip
# fix encoding
#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r
# generate references
R --vanilla < iq_utf8.r > IQout.txt
# create pdf
pdflatex iq.tex
pdflatex iq.tex
VFramework
#!/bin/bash
# fetch data
java -jar GestBarragensWSClientIQData.jar
unzip -o IQData.zip
# fix encoding
#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r
# generate references
R --vanilla < iq_utf8.r > IQout.txt
# create pdf
pdflatex iq.tex
pdflatex iq.tex
#!/bin/bash
# fetch data
java -jar GestBarragensWSClientIQData.jar
unzip -o IQData.zip
# fix encoding
#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r
# generate references
R --vanilla < iq_utf8.r > IQout.txt
# create pdf
pdflatex iq.tex
pdflatex iq.tex
VFramework
ADDED
NOT USED
Reproducibility – solved! (?)
 Provide source code, parameters, data, …
 Wrap it up in a container/virtual machine,
 Provide context information
 Encourage reproducibility studies beyond re-running
 Use it to establish trust in your research & gain new insights
 (automatically) capture process execution context
 Verify re-executions
done?
LXC
Outline
 What are the challenges in reproducibility?
 What do we gain by aiming for reproducibility?
 How to address the challenges of complex processes?
 How to deal with “Big Data”?
 Summary
 Research Data Alliance
 WG on Data Citation:
Making Dynamic Data Citeable
 WG endorsed in March 2014
- Concentrating on the problems of
large, dynamic (changing) datasets
- Focus! Identification of data!
Not: PID systems, metadata, citation string, attribution, …
- Liaise with other WGs and initiatives on data citation
(CODATA, DataCite, Force11, …)
- https://rd-alliance.org/working-groups/data-citation-wg.html
RDA WG Data Citation
Data Citation – Output
 14 Recommendations
grouped into 4 phases:
- Preparing data and query store
- Persistently identifying specific data sets
- Resolving PIDs
- Upon modifications to the data
infrastructure
 2-page flyer
https://
rd-alliance.org/recommendations-working-group-data-citation-revision-oct-20-2015.htm
 More detailed report: IEEE TCDL 2016
http://
www.ieee-tcdl.org/Bulletin/v12n1/papers/IEEE-TCDL-DC-2016_paper_1.pdf
Data Citation – Output
 14 Recommendations
grouped into 4 phases:
- Preparing data and query store
- Persistently identifying specific data sets
- Resolving PIDs
- Upon modifications to the data
infrastructure
 2-page flyer
https://rd-alliance.org/recommendations-working-
group-data-citation-revision-oct-20-2015.html
 More detailed report: IEEE TCDL 2016
http://www.ieee-tcdl.org/Bulletin/v12n1/papers/IEEE-
TCDL-DC-2016_paper_1.pdf
Detailed presentation on
Tuesday, Session 9,
12:00-13:30
3 Take-Away Messages
Message 1
Aim at achieving reproducibility at different levels
- Re-run, ask others to re-run
- Re-implement
- Port to different platforms
- Test on different data,
vary parameters (and report!)
If something is not reproducible -> investigate!
(you might be onto something!)
Encourage reproducibility studies!
3 Take-Away Messages
Message 2
Aim for better procedures and documentation
Document the research process, environment,
interim results, …
(preferably automatically, 80:20, …)
The process is part of the data (and vice versa)
Source: xkdc
Pieter Bruegel the Elder: De
Alchemist (British Museum, London)
Research Objects, Context
Models, VFramework
3 Take-Away Messages
Message 3
Aim for proper (research) data management
(not just in academia!)
Data Management Plans, Research Infrastructure Services
Source: http://www.phdcomics.com/comics.php?f=1323 RDA WGDC: Dynamic Data Citation
Detailed presentation on
Tuesday, Session 9,
12:00-13:30
Summary
 Trustworthy and efficient e-Science
 Need to move beyond preserving code + data
 Need to move beyond the focus on description
 Capture Process and entire execution context
 Precisely identify data used in process
 Verification of re-execution
 Data and process re-use as basis for data driven science
- evidence
- investment
- efficiency
Trust!!
Summary
 Preaching and eating…
 Do we do all this in our lab for our experiments?
 No! (not yet?)
 Researchers (also in CS) need assistance
 Institutions and Research Infrastructures
 … and some research on open questions on how to best do
all of this (but mind the infamous 80:20 rule)
Summary
C. Glenn Begley, Alastair M. Buchan, Ulrich Dirnagl: Robust research: Institutions must do
their part for reproducibility, Nature 525(7567), Sep 3 2015, Illustration by David Parkins
http://www.nature.com/news/robust-research-institutions-must-do-their-part-for-
reproducibility-1.18259?WT.mc_id=SFB_NNEWS_1508_RHBox
Acknowledgements
 Johannes Binder
 Rudolf Mayer
 Tomasz Miksa
 Stefan Pröll
 Stephan Strodl
 Marco Unterberger
 TIMBUS
 SBA: Secure Business Austria
 RDA: Research Data Alliance WGDC
References
 Juliana Freire, Norbert Fuhr, and Andreas Rauber. Reproducibility of Data-Oriented
Experiments in eScience. Dagstuhl Reports, 6(1), 2016.
 Andreas Rauber, Ari Asmi, Dieter van Uytvanck and Stefan Proell. Identification of
Reproducible Subsets for Data Citation, Sharing and Re-Use. Bulletin of IEEE Technical
Committee on Digital Libraries (TCDL), vol. 12, 2016.
 Andreas Rauber, Tomasz Miksa, Rudolf Mayer and Stefan Proell. Repeatability and Re-
Usability in Scientific Processes: Process Context, Data Identification and Verification. In
Proceedings of the 17th International Conference on Data Analytics and Management in
Data Intensive Domains (DAMDID), 2015.
 Tomasz Miksa, Rudolf Mayer and Andreas Rauber. Ensuring sustainability of web services
dependent processes. International Journal of Computational Science and Engineering
(IJCSE). 2015 Vol.10, No.1/2, pp.70 – 81
 Rudolf Mayer and Andreas Rauber, A Quantitative Study on the Re-executability of
Publicly Shared Scientific Workflows. 11th IEEE Intl. Conference on e-Science, 2015.
 Rudolf Mayer, Tomasz Miksa and Andreas Rauber. Ontologies for describing the context of
scientific experiment processes. 10th IEEE Intl. Conference on e-Science, 2014.
 Tomasz Miksa, Stefan Proell, Rudolf Mayer, Stephan Strodl, Ricardo Vieira, Jose Barateiro
and Andreas Rauber, Framework for verification of preserved and redeployed processes.
10th International Conference on Preservation of Digital Objects (IPRES2013), 2013.
 Tomasz Miksa, Stephan Strodl and Andreas Rauber, Process Management Plans.
International Journal of Digital Curation, Vol 9, No 1 (2014),pp. 83-97.
Thank you!
http://www.ifs.tuwien.ac.at/imp

Contenu connexe

En vedette

Invasoras: la cara oculta de gatitos y conejitos
Invasoras: la cara oculta de gatitos y conejitosInvasoras: la cara oculta de gatitos y conejitos
Invasoras: la cara oculta de gatitos y conejitosTxema Campillo
 
Anthony's PowerPoint Resume
Anthony's PowerPoint ResumeAnthony's PowerPoint Resume
Anthony's PowerPoint ResumeCareer Corps
 
Configuraracion basica de windows server 2008
Configuraracion basica de windows server 2008Configuraracion basica de windows server 2008
Configuraracion basica de windows server 2008syed usman ali shah
 
\Biomex Instruments : Company Profile
\Biomex Instruments : Company Profile\Biomex Instruments : Company Profile
\Biomex Instruments : Company ProfileNetscribes, Inc.
 
1890 1920-antiguas etiquetas de bebidas-retratos
1890 1920-antiguas etiquetas de bebidas-retratos1890 1920-antiguas etiquetas de bebidas-retratos
1890 1920-antiguas etiquetas de bebidas-retratosCarlos Cueto
 
Sobreviver nos tropicos e no deserto
Sobreviver nos tropicos e no desertoSobreviver nos tropicos e no deserto
Sobreviver nos tropicos e no desertoDaniel Cepa
 
Mantenimiento preventivo proaces
Mantenimiento preventivo proacesMantenimiento preventivo proaces
Mantenimiento preventivo proacesJesus B. Rodriguez
 
Dr. Jusuf Kardavi - Muslimanja e së nesërmes
Dr. Jusuf Kardavi - Muslimanja e së nesërmesDr. Jusuf Kardavi - Muslimanja e së nesërmes
Dr. Jusuf Kardavi - Muslimanja e së nesërmesShkumbim Jakupi
 
Herramientas para desarrollo de un Cuadro de Mando
Herramientas para desarrollo de un Cuadro de MandoHerramientas para desarrollo de un Cuadro de Mando
Herramientas para desarrollo de un Cuadro de MandoTesys Consultores
 
Competències del Dinamitzador Digital
Competències del Dinamitzador DigitalCompetències del Dinamitzador Digital
Competències del Dinamitzador DigitalCiberteka
 
Guía TeleTriunfador para graduandos del PNFSI/PNFI de Misión Sucre
Guía TeleTriunfador para graduandos del PNFSI/PNFI de Misión SucreGuía TeleTriunfador para graduandos del PNFSI/PNFI de Misión Sucre
Guía TeleTriunfador para graduandos del PNFSI/PNFI de Misión SucreStephenson Prieto
 
1. est. org. y direc. obras
1. est. org. y direc. obras1. est. org. y direc. obras
1. est. org. y direc. obrasIssi Cañas
 
Fundamentos Técnicos del Golf "Fases del Swing" rescatado por Luis Fernando H...
Fundamentos Técnicos del Golf "Fases del Swing" rescatado por Luis Fernando H...Fundamentos Técnicos del Golf "Fases del Swing" rescatado por Luis Fernando H...
Fundamentos Técnicos del Golf "Fases del Swing" rescatado por Luis Fernando H...Luis Fernando Heras Portillo
 

En vedette (19)

Poemes tardor
Poemes tardorPoemes tardor
Poemes tardor
 
Invasoras: la cara oculta de gatitos y conejitos
Invasoras: la cara oculta de gatitos y conejitosInvasoras: la cara oculta de gatitos y conejitos
Invasoras: la cara oculta de gatitos y conejitos
 
Anthony's PowerPoint Resume
Anthony's PowerPoint ResumeAnthony's PowerPoint Resume
Anthony's PowerPoint Resume
 
Merlo
MerloMerlo
Merlo
 
Configuraracion basica de windows server 2008
Configuraracion basica de windows server 2008Configuraracion basica de windows server 2008
Configuraracion basica de windows server 2008
 
\Biomex Instruments : Company Profile
\Biomex Instruments : Company Profile\Biomex Instruments : Company Profile
\Biomex Instruments : Company Profile
 
1890 1920-antiguas etiquetas de bebidas-retratos
1890 1920-antiguas etiquetas de bebidas-retratos1890 1920-antiguas etiquetas de bebidas-retratos
1890 1920-antiguas etiquetas de bebidas-retratos
 
Sobreviver nos tropicos e no deserto
Sobreviver nos tropicos e no desertoSobreviver nos tropicos e no deserto
Sobreviver nos tropicos e no deserto
 
Mantenimiento preventivo proaces
Mantenimiento preventivo proacesMantenimiento preventivo proaces
Mantenimiento preventivo proaces
 
Enero 2016
Enero 2016Enero 2016
Enero 2016
 
Dr. Jusuf Kardavi - Muslimanja e së nesërmes
Dr. Jusuf Kardavi - Muslimanja e së nesërmesDr. Jusuf Kardavi - Muslimanja e së nesërmes
Dr. Jusuf Kardavi - Muslimanja e së nesërmes
 
Herramientas para desarrollo de un Cuadro de Mando
Herramientas para desarrollo de un Cuadro de MandoHerramientas para desarrollo de un Cuadro de Mando
Herramientas para desarrollo de un Cuadro de Mando
 
Competències del Dinamitzador Digital
Competències del Dinamitzador DigitalCompetències del Dinamitzador Digital
Competències del Dinamitzador Digital
 
The 2013 Google Travel Study
The 2013 Google Travel StudyThe 2013 Google Travel Study
The 2013 Google Travel Study
 
Guía TeleTriunfador para graduandos del PNFSI/PNFI de Misión Sucre
Guía TeleTriunfador para graduandos del PNFSI/PNFI de Misión SucreGuía TeleTriunfador para graduandos del PNFSI/PNFI de Misión Sucre
Guía TeleTriunfador para graduandos del PNFSI/PNFI de Misión Sucre
 
Bionic Bio-Elite Fertilizer
Bionic Bio-Elite FertilizerBionic Bio-Elite Fertilizer
Bionic Bio-Elite Fertilizer
 
Sketchup
SketchupSketchup
Sketchup
 
1. est. org. y direc. obras
1. est. org. y direc. obras1. est. org. y direc. obras
1. est. org. y direc. obras
 
Fundamentos Técnicos del Golf "Fases del Swing" rescatado por Luis Fernando H...
Fundamentos Técnicos del Golf "Fases del Swing" rescatado por Luis Fernando H...Fundamentos Técnicos del Golf "Fases del Swing" rescatado por Luis Fernando H...
Fundamentos Técnicos del Golf "Fases del Swing" rescatado por Luis Fernando H...
 

Similaire à Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Knowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceKnowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceDavid De Roure
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos EngineeringSIGHUP
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringTao Xie
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilChristian Frech
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Using Docker Containers to Improve Reproducibility in Software and Web Engine...Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Using Docker Containers to Improve Reproducibility in Software and Web Engine...Vincenzo Ferme
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxPyData
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys AdminsPuppet
 
Efficient Data Labelling for Ocular Imaging
Efficient Data Labelling for Ocular ImagingEfficient Data Labelling for Ocular Imaging
Efficient Data Labelling for Ocular ImagingPetteriTeikariPhD
 
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...Keiichiro Ono
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Felix Z. Hoffmann
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpcDr Reeja S R
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 

Similaire à Reproducibility challenges in computational settings: what are they, why should we address them, and how? (20)

Knowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceKnowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems Science
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos Engineering
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software Engineering
 
Reproducible Science and Deep Software Variability
Reproducible Science and Deep Software VariabilityReproducible Science and Deep Software Variability
Reproducible Science and Deep Software Variability
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and Anduril
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Using Docker Containers to Improve Reproducibility in Software and Web Engine...Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Using Docker Containers to Improve Reproducibility in Software and Web Engine...
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys Admins
 
Executable papers
Executable papersExecutable papers
Executable papers
 
Efficient Data Labelling for Ocular Imaging
Efficient Data Labelling for Ocular ImagingEfficient Data Labelling for Ocular Imaging
Efficient Data Labelling for Ocular Imaging
 
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
R tutorial
R tutorialR tutorial
R tutorial
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 

Plus de Research Data Alliance

The Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to IndividualsThe Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to IndividualsResearch Data Alliance
 
The Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to IndividualsThe Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to IndividualsResearch Data Alliance
 
RDA Value for Infrastructure Providers
RDA Value for Infrastructure ProvidersRDA Value for Infrastructure Providers
RDA Value for Infrastructure ProvidersResearch Data Alliance
 
The Value of the Rda Value for Organisations Performing Research
The Value of the Rda Value for Organisations Performing ResearchThe Value of the Rda Value for Organisations Performing Research
The Value of the Rda Value for Organisations Performing ResearchResearch Data Alliance
 

Plus de Research Data Alliance (20)

RDA in a Nutshell - September 2020
RDA in a Nutshell - September 2020RDA in a Nutshell - September 2020
RDA in a Nutshell - September 2020
 
RDA in a Nutshell - August 2020
RDA in a Nutshell - August 2020RDA in a Nutshell - August 2020
RDA in a Nutshell - August 2020
 
RDA in a Nutshell - July 2020
RDA in a Nutshell - July 2020RDA in a Nutshell - July 2020
RDA in a Nutshell - July 2020
 
RDA in a Nutshell - June 2020
RDA in a Nutshell - June 2020RDA in a Nutshell - June 2020
RDA in a Nutshell - June 2020
 
RDA in a Nutshell - May 2020
RDA in a Nutshell - May 2020RDA in a Nutshell - May 2020
RDA in a Nutshell - May 2020
 
RDA in a Nutshell - April 2020
RDA in a Nutshell - April 2020RDA in a Nutshell - April 2020
RDA in a Nutshell - April 2020
 
RDA in a Nutshell - March 2020
RDA in a Nutshell - March 2020RDA in a Nutshell - March 2020
RDA in a Nutshell - March 2020
 
RDA in a Nutshell - February 2020
RDA in a Nutshell - February 2020RDA in a Nutshell - February 2020
RDA in a Nutshell - February 2020
 
RDA in a Nutshell - January 2020
RDA in a Nutshell - January 2020RDA in a Nutshell - January 2020
RDA in a Nutshell - January 2020
 
Rda in a Nutshell - December 2019
Rda in a Nutshell - December 2019Rda in a Nutshell - December 2019
Rda in a Nutshell - December 2019
 
Rda in a Nutshell - November 2019
Rda in a Nutshell - November 2019Rda in a Nutshell - November 2019
Rda in a Nutshell - November 2019
 
RDA in a Nutshell - October 2019
RDA in a Nutshell - October 2019RDA in a Nutshell - October 2019
RDA in a Nutshell - October 2019
 
The Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to IndividualsThe Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to Individuals
 
The Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to IndividualsThe Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to Individuals
 
RDA Value for Infrastructure Providers
RDA Value for Infrastructure ProvidersRDA Value for Infrastructure Providers
RDA Value for Infrastructure Providers
 
Rda in a nutshell september 2019
Rda in a nutshell september 2019Rda in a nutshell september 2019
Rda in a nutshell september 2019
 
The Value of the Rda Value for Organisations Performing Research
The Value of the Rda Value for Organisations Performing ResearchThe Value of the Rda Value for Organisations Performing Research
The Value of the Rda Value for Organisations Performing Research
 
RDA Value for Libraries
RDA Value for LibrariesRDA Value for Libraries
RDA Value for Libraries
 
The Value of the RDA for Funders
The Value of the RDA for FundersThe Value of the RDA for Funders
The Value of the RDA for Funders
 
Rda in a nutshell august 2019
Rda in a nutshell august 2019Rda in a nutshell august 2019
Rda in a nutshell august 2019
 

Dernier

CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 

Dernier (20)

CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 

Reproducibility challenges in computational settings: what are they, why should we address them, and how?

  • 1. Reproducibility Challenges in Computational Settings: What are they, why should we address them, and how? Andreas Rauber Vienna University of Technology rauber@ifs.tuwien.ac.at http://www.ifs.tuwien.ac.at/~andi
  • 2. Outline  What are the challenges in reproducibility?  What do we gain from reproducibility? (and: why is non-reproducibility interesting?)  How to address the challenges of complex processes?  How to deal with “Big Data”?  Summary
  • 3. Challenges in Reproducibility  Challenges in reproducibility  or: Why data sharing is not enough  FAIR principles are a necessity  Data Management and DMPs are a necessity  But they are not sufficient if we want to - Ensure reproducibility - Enable metastudies - Benefit from efficient eScience  …unless we define data broader than we commonly tend to do
  • 5. Challenges in Reproducibility  Excursion: Scientific Processes
  • 6. Challenges in Reproducibility  Excursion: scientific processes set1_freq440Hz_Am12.0Hz set1_freq440Hz_Am05.5Hz set1_freq440Hz_Am11.0Hz Java Matlab
  • 7. Challenges in Reproducibility  Excursion: Scientific Processes  Bug?  Psychoacoustic transformation tables?  Forgetting a transformation?  Different implementation of filters?  Limited accuracy of calculation?  Difference in FFT implementation?  ...?
  • 10. Challenges in Reproducibility  Large scale quantitative analysis  Obtain workflows from MyExperiments.org - March 2015: almost 2.700 WFs (approx. 300-400/year) - Focus on Taverna 2 WFs: 1.443 WFs - Published by authors  should be „better quality“  Try to re-execute the workflows - Record data on the reasons for failure along  Analyse the most common reasons for failures
  • 11. Re-Execution results Majority of workflows fails Only 23.6 % are successfully executed - No analysis yet on correctness of results… Challenges in Reproducibility Rudolf Mayer, Andreas Rauber, “A Quantitative Study on the Re-executability of Publicly Shared Scientific Workflows”, 11th IEEE Intl. Conference on e-Science, 2015.
  • 12. Computer Science  613 papers in 8 ACM conferences  Process - download paper and classify - search for a link to code (paper, web, email twice) - download code - build and execute Christian Collberg and Todd Proebsting. “Repeatability in Computer Systems Research,” CACM 59(3):62-69.2016
  • 13. In a nutshell – and another aspect of reproducibility: Challenges in Reproducibility Source: xkcd
  • 14. Reproducibility – solved! (?)  Reproducibility is more than just sharing the data!  Provide source code, parameters, data, …  Ensure that it works: Wrap it up in a container/virtual machine, … … done? LXC
  • 15. Outline  What are the challenges in reproducibility?  What do we gain by aiming for reproducibility?  How to address the challenges of complex processes?  How to deal with dynamic data?  Summary
  • 16. Reproducibility – solved! (?)  Provide source code, parameters, data, …  Wrap it up in a container/virtual machine, … …  Why do we want reproducibility?  Which levels or reproducibility are there?  What do we gain by different levels of reproducibility? LXC
  • 17. Reproducibility – solved! (?)  Dagstuhl Seminar: Reproducibility of Data-Oriented Experiments in e-Science January 2016, Dagstuhl, Germany
  • 18. Types of Reproducibility  The PRIMAD1 model: which attributes can we “prime”? - Data • Parameters • Input data - Plattform - Implementation - Method - Research Objective - Actors  What do we gain by priming one or the other? [1] Juliana Freire, Norbert Fuhr, and Andreas Rauber. Reproducibility of Data-Oriented Experiments in eScience. Dagstuhl Reports, 6(1), 2016.
  • 20. Reproducibility Papers  Aim for reproducibility: for one’s own sake – and as Chairs of conference tracks, editor, reviewer, superviser, … - Review of reproducibility of submitted work (material provided) - Encouraging reproducibility studies - (Messages to stakeholders in Dagstuhl Report)  Consistency of results, not identity!  Reproducibility studies and papers - Not just re-running code / a virtual machine - When is a reproducibility paper worth the effort / worth being published?
  • 21. Reproducibility Papers  When is a Reproducibility paper worth being published?
  • 22. Learning from Non-Reproducibility  Do we always want reproducibility? - Scientifically speaking: yes!  Research is addressing challenges: - Looking for and learning from non-reproducibility!  Non-reproducibility if - Some (un-known) aspect of a study influences results - Technical: parameter sweep, bug in code, OS, … -> fix it! - Non-technical: input data! (specifically: “the user”)
  • 23. Learning from Non-Reproducibility Challenges in MIR – “things sometimes don’t seem to work” Virtual Box, Github, <your favourite tool> are starting points Same features, same algorithm, different data -> Same data, different listeners -> Understanding “the rest”: - Isolating unknown influence factors - Generating hypotheses - Verifying these to understand the “entire system”, cultural and other biases, … Benchmarks and Meta-Studies
  • 24. Reproducibility – solved! (?)  Provide source code, parameters, data, …  Wrap it up in a container/virtual machine,  Provide context information  Encourage reproducibility studies beyond re-running  Use it to establish trust in your research & gain new insights done? LXC
  • 25. Outline  What are the challenges in reproducibility?  What do we gain by aiming for reproducibility?  How to address the challenges of complex processes?  How to deal with “Big Data”?  Summary
  • 27. And the solution is…  Standardization and Documentation - Standardized components, procedures, workflows - Documenting complete system set-up across entire provenance chain  How to do this – efficiently? Alexander Graham Bell’s Notebook, March 9 1876 https://commons.wikimedia.org/wiki/File:Alexander_Graham_Bell's_notebook,_March_9,_1876.PNG Pieter Bruegel the Elder: De Alchemist (British Museum, London)
  • 28. Documenting a Process  Context Model: establish what to document and how  Meta-model for describing process & context - Extensible architecture integrated by core model - Reusing existing models as much as possible - Based on ArchiMate, implemented using OWL  Extracted by static and dynamic analysis
  • 29. Context Model – Static Analysis  Analyses steps, platforms, services, tools called  Dependencies (packages, libraries)  HW, SW Licenses, … Taverna Workflow ArchiMate model Context Model (OWL ontology) #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatex iq.tex pdflatex iq.tex Script
  • 30. Context Model – Dynamic Analysis  Process Migration Framework (PMF) - designed for automatic redeployments into virtual machines - uses strace to monitor system calls - complete log of all accessed resources (files, ports) - captures and stores process instance data - analyse resources (file formats via PRONOM, PREMIS)
  • 31. Context Model – Dynamic Analysis Taverna Workflow
  • 32. VFramework Are these processes the same? Original environment Redeployment environmentRepository Preserve Redeploy
  • 34. VFramework #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatex iq.tex pdflatex iq.tex
  • 35. VFramework #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatex iq.tex pdflatex iq.tex
  • 36. VFramework #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatex iq.tex pdflatex iq.tex
  • 37. #!/bin/bash # fetch data java -jar GestBarragensWSClientIQData.jar unzip -o IQData.zip # fix encoding #iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r # generate references R --vanilla < iq_utf8.r > IQout.txt # create pdf pdflatex iq.tex pdflatex iq.tex VFramework ADDED NOT USED
  • 38. Reproducibility – solved! (?)  Provide source code, parameters, data, …  Wrap it up in a container/virtual machine,  Provide context information  Encourage reproducibility studies beyond re-running  Use it to establish trust in your research & gain new insights  (automatically) capture process execution context  Verify re-executions done? LXC
  • 39. Outline  What are the challenges in reproducibility?  What do we gain by aiming for reproducibility?  How to address the challenges of complex processes?  How to deal with “Big Data”?  Summary
  • 40.  Research Data Alliance  WG on Data Citation: Making Dynamic Data Citeable  WG endorsed in March 2014 - Concentrating on the problems of large, dynamic (changing) datasets - Focus! Identification of data! Not: PID systems, metadata, citation string, attribution, … - Liaise with other WGs and initiatives on data citation (CODATA, DataCite, Force11, …) - https://rd-alliance.org/working-groups/data-citation-wg.html RDA WG Data Citation
  • 41. Data Citation – Output  14 Recommendations grouped into 4 phases: - Preparing data and query store - Persistently identifying specific data sets - Resolving PIDs - Upon modifications to the data infrastructure  2-page flyer https:// rd-alliance.org/recommendations-working-group-data-citation-revision-oct-20-2015.htm  More detailed report: IEEE TCDL 2016 http:// www.ieee-tcdl.org/Bulletin/v12n1/papers/IEEE-TCDL-DC-2016_paper_1.pdf
  • 42. Data Citation – Output  14 Recommendations grouped into 4 phases: - Preparing data and query store - Persistently identifying specific data sets - Resolving PIDs - Upon modifications to the data infrastructure  2-page flyer https://rd-alliance.org/recommendations-working- group-data-citation-revision-oct-20-2015.html  More detailed report: IEEE TCDL 2016 http://www.ieee-tcdl.org/Bulletin/v12n1/papers/IEEE- TCDL-DC-2016_paper_1.pdf Detailed presentation on Tuesday, Session 9, 12:00-13:30
  • 43. 3 Take-Away Messages Message 1 Aim at achieving reproducibility at different levels - Re-run, ask others to re-run - Re-implement - Port to different platforms - Test on different data, vary parameters (and report!) If something is not reproducible -> investigate! (you might be onto something!) Encourage reproducibility studies!
  • 44. 3 Take-Away Messages Message 2 Aim for better procedures and documentation Document the research process, environment, interim results, … (preferably automatically, 80:20, …) The process is part of the data (and vice versa) Source: xkdc Pieter Bruegel the Elder: De Alchemist (British Museum, London) Research Objects, Context Models, VFramework
  • 45. 3 Take-Away Messages Message 3 Aim for proper (research) data management (not just in academia!) Data Management Plans, Research Infrastructure Services Source: http://www.phdcomics.com/comics.php?f=1323 RDA WGDC: Dynamic Data Citation Detailed presentation on Tuesday, Session 9, 12:00-13:30
  • 46. Summary  Trustworthy and efficient e-Science  Need to move beyond preserving code + data  Need to move beyond the focus on description  Capture Process and entire execution context  Precisely identify data used in process  Verification of re-execution  Data and process re-use as basis for data driven science - evidence - investment - efficiency Trust!!
  • 47. Summary  Preaching and eating…  Do we do all this in our lab for our experiments?  No! (not yet?)  Researchers (also in CS) need assistance  Institutions and Research Infrastructures  … and some research on open questions on how to best do all of this (but mind the infamous 80:20 rule)
  • 48. Summary C. Glenn Begley, Alastair M. Buchan, Ulrich Dirnagl: Robust research: Institutions must do their part for reproducibility, Nature 525(7567), Sep 3 2015, Illustration by David Parkins http://www.nature.com/news/robust-research-institutions-must-do-their-part-for- reproducibility-1.18259?WT.mc_id=SFB_NNEWS_1508_RHBox
  • 49. Acknowledgements  Johannes Binder  Rudolf Mayer  Tomasz Miksa  Stefan Pröll  Stephan Strodl  Marco Unterberger  TIMBUS  SBA: Secure Business Austria  RDA: Research Data Alliance WGDC
  • 50. References  Juliana Freire, Norbert Fuhr, and Andreas Rauber. Reproducibility of Data-Oriented Experiments in eScience. Dagstuhl Reports, 6(1), 2016.  Andreas Rauber, Ari Asmi, Dieter van Uytvanck and Stefan Proell. Identification of Reproducible Subsets for Data Citation, Sharing and Re-Use. Bulletin of IEEE Technical Committee on Digital Libraries (TCDL), vol. 12, 2016.  Andreas Rauber, Tomasz Miksa, Rudolf Mayer and Stefan Proell. Repeatability and Re- Usability in Scientific Processes: Process Context, Data Identification and Verification. In Proceedings of the 17th International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID), 2015.  Tomasz Miksa, Rudolf Mayer and Andreas Rauber. Ensuring sustainability of web services dependent processes. International Journal of Computational Science and Engineering (IJCSE). 2015 Vol.10, No.1/2, pp.70 – 81  Rudolf Mayer and Andreas Rauber, A Quantitative Study on the Re-executability of Publicly Shared Scientific Workflows. 11th IEEE Intl. Conference on e-Science, 2015.  Rudolf Mayer, Tomasz Miksa and Andreas Rauber. Ontologies for describing the context of scientific experiment processes. 10th IEEE Intl. Conference on e-Science, 2014.  Tomasz Miksa, Stefan Proell, Rudolf Mayer, Stephan Strodl, Ricardo Vieira, Jose Barateiro and Andreas Rauber, Framework for verification of preserved and redeployed processes. 10th International Conference on Preservation of Digital Objects (IPRES2013), 2013.  Tomasz Miksa, Stephan Strodl and Andreas Rauber, Process Management Plans. International Journal of Digital Curation, Vol 9, No 1 (2014),pp. 83-97.

Notes de l'éditeur

  1. - test instance selection - local dependencies - external dependnecies - provenance data - state of the context model [integrated + scheme] (discussion on the minimal context model)
  2. - test instance selection - local dependencies - external dependnecies - provenance data - state of the context model [integrated + scheme] (discussion on the minimal context model)