SlideShare une entreprise Scribd logo
1  sur  28
Why Workflows Break - Understanding and Combating
                                      Decay in Taverna Workflows
                 Jun Zhao, Jose Manuel Gomez-Perez, Khalid Belhajjame, Graham Klyne,
                   Esteban Garcia-Cuesta, Aleix Garrido, Kristina Hettne, Marco Roos,
                                  David De Roure, and Carole Goble

                                                                   IEEE eScience 2012. Chicago, USA   10 October, 2012
http://www.flickr.com/photos/sheepies/3798650645/ @ CC BY-NC 2.0
Reproducibility: Why Bother?
◉    Results produced by scientists not only give insight,
     they lead to progress and are built upon

◉    Therefore, the ability to test results is important

◉    In natural sciences, when a scientist claims an
     experimental result, then others scientist should be
     able to check it.

◉    This should be also possible for experiments carried
     out in computational environments.
                                    IEEE eScience 2012. Chicago, USA   10 October, 2012
47 of 53
 “landmark”
 publications
 could not be
 replicated

 Inadequate cell lines
 and animal models




Nature, 483, 2012
    Credit to Carole Goble JCDL 2012 Keynote
Reproducibility: Why Bother?
◉    Results produced by scientists not only give insight,
     they lead to progress and are built upon

◉    Therefore, the ability to test results is important

◉    In natural sciences, when a scientist claims an
     experimental result, then other scientists should be
     able to check it.

◉    This should be also possible for experiments carried
     out in computational environments.
                                    IEEE eScience 2012. Chicago, USA   10 October, 2012
Reproducibility: Why Bother?
◉    Results produced by scientists not only give insight,
     they lead to progress and are built upon

◉    Therefore, the ability to test results is important

◉    In natural sciences, when a scientist claims an
     experimental result, then other scientists should be
     able to check it.

◉    This should be also possible for experiments carried
     out in computational environments.
                                    IEEE eScience 2012. Chicago, USA   10 October, 2012
A famous quote

An article about computational science in a scientific
publication is not the scholarship itself, it is merely
advertising of the scholarship. The actual
scholarship is the complete software
development environment and the complete
set of instructions which generated the figures.

Jon B. Buckheit and David L. Donoho,
WaveLab and reproducible research,
1995
                                 IEEE eScience 2012. Chicago, USA   10 October, 2012
Another quote

Abandoning the habit of secrecy in favor of
process transparency and peer review was the
crucial step by which alchemy became
chemistry.

Eric S. Raymond, The art of UNIX
programming, 2004

                           IEEE eScience 2012. Chicago, USA   10 October, 2012
Workflows: A Means for
Preserving Scientific Methods
 Fortunately, there is a means that can be used to document
 the experiment that the scientist ran, and even re-run it!
                                                    chromosome17         chromosome37

Scientific workflows
                                                     Kegg pathway
                                                      Kegg pathway        Kegg pathway
                                                                           Kegg pathway
                                                        query
                                                         query               query
                                                                              query
    Increasingly adopted in modern sciences.

    Transparent documentation of
                                                              Detect common
                                                               Detect common
                                                                 pathways
                                                                  pathways
    experimental methods
                                                              Common pathways
    Repeatable and configurable
                                   IEEE eScience 2012. Chicago, USA   10 October, 2012
Workflow Decay
   A decayed or reduced ability to be executed or
   produce the same results

Our Contributions
  An empirical analysis for identifying and
categorizing the causes of workflow decay
  A software framework to assess workflow
preservation
Storyline
 The importance of reproducibility

 Workflow as a means for preserving scientific methods

   Understanding the causes of workflow decay

   Combating decay

   Lessons learnt and future work


                               IEEE eScience 2012. Chicago, USA   10 October, 2012
Understanding The Causes of
     Workflow Decay
 We adopted an empirical approach
    To identify the causes of workflow decay
    To quantify their severity

 To do so, we analyzed a sample of real
 workflows to determine if they suffer from
 decay and the reasons that caused their decay


                                 IEEE eScience 2012. Chicago, USA   10 October, 2012
Experimental Setup
Taverna workflows from     Software environment
myExperiment.org                Taverna 2.3
    Taverna 1
    Taverna 2
                           Experiment metadata
                                June-July 2012
Selection process               4 researchers
    By the creation year
    By the creator
    By the domain




                            IEEE eScience 2012. Chicago, USA   10 October, 2012
Analyzed Workflows
             Number of Taverna 1 workflows from 2007 to 2011
                  2007          2008         2009              2010             2011
Tested             12            10           10                10                4*
Total              74            341          101               26                13



                 Number of Taverna 2 workflows from 2009 to 2012
                          2009         2010             2011              2012
         Tested            12          10                 15                9
         Total             97          308               289              184




                                                   IEEE eScience 2012. Chicago, USA    10 October, 2012
Profile of Analyzed Workflows




                 IEEE eScience 2012. Chicago, USA   10 October, 2012
The Proportion of Decay
Taverna 1

                      75% of the 92 tested
                      workflows failed to be
                      either executed or
                      produce the same result
                      (if testable)

                      Those from early years
Taverna 2             (2007-2009) had 91%
                      failure rate




                       IEEE eScience 2012. Chicago, USA   10 October, 2012
The Cause of Decay
Manual analysis
    By the validation report from Taverna workbench
    By interpreting experiment results reported by Taverna

Identified 4 categories of causes
    Missing example data
    Missing execution environment
    Insufficient descriptions about workflows
    Volatile third-party Resources

Other unconsidered possible factors
    Changes in the local operating environment (hardware, OS,
    middleware, compiler, etc)


                                     IEEE eScience 2012. Chicago, USA   10 October, 2012
Decay Caused by Third-Party
Causes
                             Resources Examples
                   Refined Causes
Third party resources   Underlying dataset, particularly those       Researcher hosting the data changed
are not available       locally hosted in-house dataset, is no       institution, server is no longer available
                        longer available
                        Services are deprecated                      DDBJ web services are not longer
                                                                     provided despite the fact that they are
                                                                     used in many myExperiment
                                                                     workflows
Third party resources   Data is available but identified using       Due to scalability reasons the input
are available but not   different IDs than the ones known to         data is superseded by new one making
accessible              the user                                     the workflow not executable or
                                                                     providing wrong results
                        Data is available but permission,            Cannot get the input, which is a
                        certificate, or network to access it is      security token that can only be
                        needed                                       obtained by a registered user of
                                                                     ChemiSpider
                        Services are available but need              The security policies of the execution
                        permission, certificate, or network to       framework are updated due to new
                        access and invoke them                       hosting institution rules
Third party resources   Services are still available by using the     The web services are updated
have changed            same identifiers but their functionality
                        have changed                            IEEE eScience 2012. Chicago, USA 10 October, 2012
The Cause of Decay
Manual analysis
    By the validation report from Taverna workbench
    By interpreting experiment results reported by Taverna

Identified 4 categories of causes
    Missing example data
    Missing execution environment
    Insufficient descriptions about workflows
    Volatile third-party Resources

Other unconsidered possible factors
    Changes in the local operating environment (hardware, OS,
    middleware, compiler, etc)


                                     IEEE eScience 2012. Chicago, USA   10 October, 2012
Summary of Decay Causes
               50% of the decay was caused by
               volatility of 3rd-party resource
                    Unavailable
                    Inaccessible
                    Updated

               Missing example data
                    Unable to re-run

               Missing execution environment
                    Such as local plugins

               Insufficient metadata
                    Such as any required
                    dependency libraries or
                    permission information




                IEEE eScience 2012. Chicago, USA   10 October, 201
Storyline
 The importance of reproducibility

 Workflow as a means for preserving scientific methods

 Understanding the causes of workflow decay

• Combating decay

• Lessons learnt and future work



                               IEEE eScience 2012. Chicago, USA   10 October, 2012
Combating Workflow Decay
      •    Objective: To provide enough information to
           – Prevent decay
           – Detect decay
           – Repair decay



      •     Approach: Research Objects + Checklists
            – Research Objects [1][2]: Aggregate workflow specifications
                t
           o jec together with auxiliary elements, such as example data inputs,
         Pr       annotations, provenance traces that can be used to prevent
     ver
  f4E             decay and/or repair the workflow in case of decay.
W
            – Checklists: to check that sufficient information is preserved
                  along with the workflows
           [1] http://wf4ever.github.com/ro/
           [2] http://wf4ever.github.com/ro-primer/   IEEE eScience 2012. Chicago, USA   10 October, 2012
Checklists
• Checklists are a well established tool
for guiding practices to ensure safety,
quality and consistency in the conduct
of complex operations.

• They have been adopted by the
biological research community to
promote consistency across research
datasets

• In our case, we use checklists to
assess if a research object contains
sufficient information for running the
workflow and checking that its results
are replicable.
    IEEE eScience 2012. Chicago, USA   10 October, 2012
Cheklist-ing the Reproducibility
                of a Workflow




The Minim model used in our approach is an adaptation of the MiM model [1][2].
[1] Matthew Gamble, Jun Zaho, Graham Klyne and Carole Goble. MIM: A Minimum Information Model Vocabulary and
Framework for Scientific Linked Data. eScience 2012
[2] https://raw.github.com/wf4ever/ro-manager/master/src/iaeval/Minim/minim.rdf

                                                                 IEEE eScience 2012. Chicago, USA   10 October, 2012
Use Case
•   4 myExperiment packs
    –   2 from genomics, 1 from geography, and 1 domain-neutral

•   Experiment process:
    –   Transform them into RO
    –   Create checklist descriptions

•   Observations
    –   2 research objects were found not to contain the necessary
        information to run them, 2 others failed because of update to
        third party resources and environment of execution.



                                         IEEE eScience 2012. Chicago, USA   10 October, 2012
Storyline
 The importance of reproducibility

 Workflow as a means for preserving scientific methods

 Understanding the causes of workflow decay

• Combating decay

• Lessons Learnt and future work



                               IEEE eScience 2012. Chicago, USA   10 October, 2012
Lessons Learnt


1. Dependency is the root enemy of reproducible
   workflows

2. Documentation, i.e., annotation, is vital

3. Documentation should be easy to create




                                    IEEE eScience 2012. Chicago, USA   10 October, 2012
The Future Work
•   Decay detection, explanation, and repair
•   Reproducibility and provenance
•   Working with scientists is vital for reproducible science
    –   GigaScience
    –   BioVel
    –   2020 Science




                                       IEEE eScience 2012. Chicago, USA   10 October, 2012
Acknowledgement
EU Wf4Ever project (270129)
funded under EU FP7 (ICT- 2009.4.1).
(http://www.wf4ever-project.org)




                                       The principles of provenance. Dagstuhl, March 1, 2012

Contenu connexe

Similaire à Why Workflows Break

Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Carole Goble
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data PublishingScott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data PublishingGigaScience, BGI Hong Kong
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science Carole Goble
 
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012Idafen Santana Pérez
 
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?GigaScience, BGI Hong Kong
 
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynoteCarole Goble
 
Scientific data management from the lab to the web
Scientific data management   from the lab to the webScientific data management   from the lab to the web
Scientific data management from the lab to the webJose Manuel Gómez-Pérez
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
2017-11-03 Provenance and Research Object
2017-11-03 Provenance and Research Object2017-11-03 Provenance and Research Object
2017-11-03 Provenance and Research ObjectStian Soiland-Reyes
 
The Research Object Initiative: Frameworks and Use Cases
The Research Object Initiative:Frameworks and Use CasesThe Research Object Initiative:Frameworks and Use Cases
The Research Object Initiative: Frameworks and Use CasesCarole Goble
 
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)Stian Soiland-Reyes
 
Common Motifs in Scientific Workflows: An Empirical Analysis
Common Motifs in Scientific Workflows: An Empirical AnalysisCommon Motifs in Scientific Workflows: An Empirical Analysis
Common Motifs in Scientific Workflows: An Empirical Analysisdgarijo
 
Sigir12 tutorial: Query Perfromance Prediction for IR
Sigir12 tutorial: Query Perfromance Prediction for IRSigir12 tutorial: Query Perfromance Prediction for IR
Sigir12 tutorial: Query Perfromance Prediction for IRDavid Carmel
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directionsTao He
 
Collaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsCollaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsAndrea Wiggins
 
myExperiment - Defining the Social Virtual Research Environment
myExperiment - Defining the Social Virtual Research EnvironmentmyExperiment - Defining the Social Virtual Research Environment
myExperiment - Defining the Social Virtual Research EnvironmentDavid De Roure
 

Similaire à Why Workflows Break (20)

Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data PublishingScott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
 
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012
Conservation of Scientific Workflow Infrastructures by Using Semantics - 2012
 
Reproducibility 1
Reproducibility 1Reproducibility 1
Reproducibility 1
 
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
 
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynote
 
Scientific data management from the lab to the web
Scientific data management   from the lab to the webScientific data management   from the lab to the web
Scientific data management from the lab to the web
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Research Objects in Wf4Ever
Research Objects in Wf4EverResearch Objects in Wf4Ever
Research Objects in Wf4Ever
 
2017-11-03 Provenance and Research Object
2017-11-03 Provenance and Research Object2017-11-03 Provenance and Research Object
2017-11-03 Provenance and Research Object
 
The Research Object Initiative: Frameworks and Use Cases
The Research Object Initiative:Frameworks and Use CasesThe Research Object Initiative:Frameworks and Use Cases
The Research Object Initiative: Frameworks and Use Cases
 
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
2013-07-19 myExperiment research objects, beyond workflows and packs (PPTX)
 
Common Motifs in Scientific Workflows: An Empirical Analysis
Common Motifs in Scientific Workflows: An Empirical AnalysisCommon Motifs in Scientific Workflows: An Empirical Analysis
Common Motifs in Scientific Workflows: An Empirical Analysis
 
Sigir12 tutorial: Query Perfromance Prediction for IR
Sigir12 tutorial: Query Perfromance Prediction for IRSigir12 tutorial: Query Perfromance Prediction for IR
Sigir12 tutorial: Query Perfromance Prediction for IR
 
Testing survey by_directions
Testing survey by_directionsTesting survey by_directions
Testing survey by_directions
 
Collaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsCollaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna Workflows
 
myExperiment - Defining the Social Virtual Research Environment
myExperiment - Defining the Social Virtual Research EnvironmentmyExperiment - Defining the Social Virtual Research Environment
myExperiment - Defining the Social Virtual Research Environment
 

Plus de Khalid Belhajjame

Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScienceKhalid Belhajjame
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsKhalid Belhajjame
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsKhalid Belhajjame
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Khalid Belhajjame
 

Plus de Khalid Belhajjame (16)

Provenance witha purpose
Provenance witha purposeProvenance witha purpose
Provenance witha purpose
 
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
 
Irpb workshop
Irpb workshopIrpb workshop
Irpb workshop
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
 
Anr cair meeting feb 2016
Anr cair meeting feb 2016Anr cair meeting feb 2016
Anr cair meeting feb 2016
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
 
Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)
 
Credible workshop
Credible workshopCredible workshop
Credible workshop
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
 
D-prov use-case
D-prov use-caseD-prov use-case
D-prov use-case
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow Results
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
 
Edbt 2010, Belhajjame
Edbt 2010, BelhajjameEdbt 2010, Belhajjame
Edbt 2010, Belhajjame
 

Dernier

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 

Why Workflows Break

  • 1. Why Workflows Break - Understanding and Combating Decay in Taverna Workflows Jun Zhao, Jose Manuel Gomez-Perez, Khalid Belhajjame, Graham Klyne, Esteban Garcia-Cuesta, Aleix Garrido, Kristina Hettne, Marco Roos, David De Roure, and Carole Goble IEEE eScience 2012. Chicago, USA 10 October, 2012 http://www.flickr.com/photos/sheepies/3798650645/ @ CC BY-NC 2.0
  • 2. Reproducibility: Why Bother? ◉ Results produced by scientists not only give insight, they lead to progress and are built upon ◉ Therefore, the ability to test results is important ◉ In natural sciences, when a scientist claims an experimental result, then others scientist should be able to check it. ◉ This should be also possible for experiments carried out in computational environments. IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 3. 47 of 53 “landmark” publications could not be replicated Inadequate cell lines and animal models Nature, 483, 2012 Credit to Carole Goble JCDL 2012 Keynote
  • 4. Reproducibility: Why Bother? ◉ Results produced by scientists not only give insight, they lead to progress and are built upon ◉ Therefore, the ability to test results is important ◉ In natural sciences, when a scientist claims an experimental result, then other scientists should be able to check it. ◉ This should be also possible for experiments carried out in computational environments. IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 5. Reproducibility: Why Bother? ◉ Results produced by scientists not only give insight, they lead to progress and are built upon ◉ Therefore, the ability to test results is important ◉ In natural sciences, when a scientist claims an experimental result, then other scientists should be able to check it. ◉ This should be also possible for experiments carried out in computational environments. IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 6. A famous quote An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures. Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995 IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 7. Another quote Abandoning the habit of secrecy in favor of process transparency and peer review was the crucial step by which alchemy became chemistry. Eric S. Raymond, The art of UNIX programming, 2004 IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 8. Workflows: A Means for Preserving Scientific Methods Fortunately, there is a means that can be used to document the experiment that the scientist ran, and even re-run it! chromosome17 chromosome37 Scientific workflows Kegg pathway Kegg pathway Kegg pathway Kegg pathway query query query query Increasingly adopted in modern sciences. Transparent documentation of Detect common Detect common pathways pathways experimental methods Common pathways Repeatable and configurable IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 9. Workflow Decay A decayed or reduced ability to be executed or produce the same results Our Contributions An empirical analysis for identifying and categorizing the causes of workflow decay A software framework to assess workflow preservation
  • 10. Storyline  The importance of reproducibility  Workflow as a means for preserving scientific methods Understanding the causes of workflow decay Combating decay Lessons learnt and future work IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 11. Understanding The Causes of Workflow Decay We adopted an empirical approach To identify the causes of workflow decay To quantify their severity To do so, we analyzed a sample of real workflows to determine if they suffer from decay and the reasons that caused their decay IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 12. Experimental Setup Taverna workflows from Software environment myExperiment.org Taverna 2.3 Taverna 1 Taverna 2 Experiment metadata June-July 2012 Selection process 4 researchers By the creation year By the creator By the domain IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 13. Analyzed Workflows Number of Taverna 1 workflows from 2007 to 2011 2007 2008 2009 2010 2011 Tested 12 10 10 10 4* Total 74 341 101 26 13 Number of Taverna 2 workflows from 2009 to 2012 2009 2010 2011 2012 Tested 12 10 15 9 Total 97 308 289 184 IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 14. Profile of Analyzed Workflows IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 15. The Proportion of Decay Taverna 1 75% of the 92 tested workflows failed to be either executed or produce the same result (if testable) Those from early years Taverna 2 (2007-2009) had 91% failure rate IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 16. The Cause of Decay Manual analysis By the validation report from Taverna workbench By interpreting experiment results reported by Taverna Identified 4 categories of causes Missing example data Missing execution environment Insufficient descriptions about workflows Volatile third-party Resources Other unconsidered possible factors Changes in the local operating environment (hardware, OS, middleware, compiler, etc) IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 17. Decay Caused by Third-Party Causes Resources Examples Refined Causes Third party resources Underlying dataset, particularly those Researcher hosting the data changed are not available locally hosted in-house dataset, is no institution, server is no longer available longer available Services are deprecated DDBJ web services are not longer provided despite the fact that they are used in many myExperiment workflows Third party resources Data is available but identified using Due to scalability reasons the input are available but not different IDs than the ones known to data is superseded by new one making accessible the user the workflow not executable or providing wrong results Data is available but permission, Cannot get the input, which is a certificate, or network to access it is security token that can only be needed obtained by a registered user of ChemiSpider Services are available but need The security policies of the execution permission, certificate, or network to framework are updated due to new access and invoke them hosting institution rules Third party resources Services are still available by using the The web services are updated have changed same identifiers but their functionality have changed IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 18. The Cause of Decay Manual analysis By the validation report from Taverna workbench By interpreting experiment results reported by Taverna Identified 4 categories of causes Missing example data Missing execution environment Insufficient descriptions about workflows Volatile third-party Resources Other unconsidered possible factors Changes in the local operating environment (hardware, OS, middleware, compiler, etc) IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 19. Summary of Decay Causes 50% of the decay was caused by volatility of 3rd-party resource Unavailable Inaccessible Updated Missing example data Unable to re-run Missing execution environment Such as local plugins Insufficient metadata Such as any required dependency libraries or permission information IEEE eScience 2012. Chicago, USA 10 October, 201
  • 20. Storyline  The importance of reproducibility  Workflow as a means for preserving scientific methods  Understanding the causes of workflow decay • Combating decay • Lessons learnt and future work IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 21. Combating Workflow Decay • Objective: To provide enough information to – Prevent decay – Detect decay – Repair decay • Approach: Research Objects + Checklists – Research Objects [1][2]: Aggregate workflow specifications t o jec together with auxiliary elements, such as example data inputs, Pr annotations, provenance traces that can be used to prevent ver f4E decay and/or repair the workflow in case of decay. W – Checklists: to check that sufficient information is preserved along with the workflows [1] http://wf4ever.github.com/ro/ [2] http://wf4ever.github.com/ro-primer/ IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 22. Checklists • Checklists are a well established tool for guiding practices to ensure safety, quality and consistency in the conduct of complex operations. • They have been adopted by the biological research community to promote consistency across research datasets • In our case, we use checklists to assess if a research object contains sufficient information for running the workflow and checking that its results are replicable. IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 23. Cheklist-ing the Reproducibility of a Workflow The Minim model used in our approach is an adaptation of the MiM model [1][2]. [1] Matthew Gamble, Jun Zaho, Graham Klyne and Carole Goble. MIM: A Minimum Information Model Vocabulary and Framework for Scientific Linked Data. eScience 2012 [2] https://raw.github.com/wf4ever/ro-manager/master/src/iaeval/Minim/minim.rdf IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 24. Use Case • 4 myExperiment packs – 2 from genomics, 1 from geography, and 1 domain-neutral • Experiment process: – Transform them into RO – Create checklist descriptions • Observations – 2 research objects were found not to contain the necessary information to run them, 2 others failed because of update to third party resources and environment of execution. IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 25. Storyline  The importance of reproducibility  Workflow as a means for preserving scientific methods  Understanding the causes of workflow decay • Combating decay • Lessons Learnt and future work IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 26. Lessons Learnt 1. Dependency is the root enemy of reproducible workflows 2. Documentation, i.e., annotation, is vital 3. Documentation should be easy to create IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 27. The Future Work • Decay detection, explanation, and repair • Reproducibility and provenance • Working with scientists is vital for reproducible science – GigaScience – BioVel – 2020 Science IEEE eScience 2012. Chicago, USA 10 October, 2012
  • 28. Acknowledgement EU Wf4Ever project (270129) funded under EU FP7 (ICT- 2009.4.1). (http://www.wf4ever-project.org) The principles of provenance. Dagstuhl, March 1, 2012

Notes de l'éditeur

  1. So why do we bother about reproducibility. Because results that scientists reach are not only insights. They are used to ensure progress in practice. Moreover, their results is built upon by other scientists to reach new results. Therefore, testing claimed results is crucial for science to be self-correcting. ============ Important scientific results not only give insight but also lead to practical progress. The ability to test results is crucial for science to be self-correcting. A hallmark of the scientific method is that experiments should be described in enough detail that they can be repeated and perhaps generalized. The idea in natural science is that if a scientist claims an experimental result, then another scientist should be able to check it. Similarly, in a computational environment, it should be possible to repeat a computational experiment as the authors have run it or to change the experiment to see how robust the authors’ conclusions are to changes in parameters or data (a concept called workability).
  2. As an example, In this article, a researcher has found that many basic studies on cancer are unreliable, with grim consequences for producing new medicines in the future. This supports the need for testing research results. http://www.reuters.com/article/2012/03/28/us-science-cancer-idUSBRE82R12P20120328 During a decade as head of global cancer research at Amgen, C. Glenn Begley identified 53 "landmark" publications -- papers in top journals, from reputable labs -- for his team to reproduce. Begley sought to double-check the findings before trying to build on them for drug development. Result: 47 of the 53 could not be replicated. He described his findings in a commentary piece published on Wednesday in the journal Nature.
  3. In natural sciences, when a scientists claims an experimental result, then another scientists should be able to check it. This should also be possible in a computational environment to test the experiment as the authors have run it, or even change the experiment to see how robust the conclusions reached by the authors.
  4. In natural sciences, when a scientists claims an experimental result, then other scientists should be able to check it. This should also be possible in a computational environment to test the experiment as the authors have run it, or even change the experiment to see how robust the conclusions reached by the authors. To do so, we need information that documents the experiment and specify the computational environment in which it was ran.
  5. I am quoting these two quotes to underline the fact that the experiment results are not enough and that we need information about the process whereby such results were produced.
  6. This is partly witnessed by existing workflow repositories, notably myExperiment and crowdLab, which provide scientists with the means to store, share, publish and reuse workflows. One of the good features of scientific workflows vis a vis reproducibility is that they are repeatable and configurable. Well in principle. Unfortunately, our experience suggests that workflows are likely to suffer over time from decay hindering or reducing the ability to execute them and reproduce the same results.
  7. Analyse a sample of real workflows to determine if they suffer from decay and the reasons that caused their decay
  8. C. Missing execution environment The execution of a workflow may rely on a particular local execution environment, for example, a local R server or a specific version of workflow execution software. Some of our test workflows exhibit this type of decay. Taverna often provides sufficient information about missing libraries, and sometimes workflow descriptions provide a warning about the requirement for a specific library. This type of decay appears to be fixable by installing the missing software, albeit requiring some effort. D. Insufficient descriptions about workflows Sometimes a workflow workbench cannot provide sufficient information about what caused the failure of a workflow run. Additional descriptions in the workflow can play an important role in assisting users reusing the workflows to understand the purpose of the workflow and its expected outcomes.
  9. C. Missing execution environment The execution of a workflow may rely on a particular local execution environment, for example, a local R server or a specific version of workflow execution software. Some of our test workflows exhibit this type of decay. Taverna often provides sufficient information about missing libraries, and sometimes workflow descriptions provide a warning about the requirement for a specific library. This type of decay appears to be fixable by installing the missing software, albeit requiring some effort. D. Insufficient descriptions about workflows Sometimes a workflow workbench cannot provide sufficient information about what caused the failure of a workflow run. Additional descriptions in the workflow can play an important role in assisting users reusing the workflows to understand the purpose of the workflow and its expected outcomes.
  10. Based on our findings, we bundle workflow specifications together with auxiliary information for mitigate its decay, e.g., example data inputs, annotations describing the workflow, provenance traces. The resulting aggregation is term Research Object, which abstractions that are used for workflow preservation. These are designed and tooling is developed around in the context of the Wf4Ever project To verify that a given research object provides the information necessary for preserving a workflow against decay, we adopt a checklist-based approach. Checklists are a well- established tool for guiding practices to ensure safety, quality and consistency in the conduct of complex operations [11], More recently, they have been adopted by the biological research community to promote consistency across research datasets [33] Checklists are well established tool for guiding practices to ensure safety, quality and consistency to conduct complex operations. In our cases, we use them to specify the minimum information needed to prevent workflow decay.
  11. The checklists are expressed using the Minim model
  12. As a proof of concept