The role of annotation in reproducibility (Empirical 2014)
1. The role of annotation in
reproducibility
ESWC2014 Empirical workshop
26/05/2014
Contributors: my PhD students Olga Giraldo, Daniel
Garijo, and Idafen Santana, and the Wf4Ever team
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
https://www.slideshare.com/ocorcho
2. Setting the context of this presentation
Our main assumption
“We are not so good at describing our
experiments, and this has a negative impact
in reproducibility (and understandability, and
conservation, and reconstruction)”
• Let’s see if this happens in different areas of scientific
research
• In vitro experiments in Plant Biology
• In silico experiments in several domains
• The challenge
• Let’s use annotation as a means to increase reproducibility
• Note: see the last slide on terminology
5. The role of laboratory protocols in Life Sciences
Laboratory
Protocols
http://mibbi.sourceforge.net/about.shtml
Laboratory protocols support
the scientific results
6. Laboratory Protocols
• Written in natural language
• Generally, presented in a “recipe” style
• Description of a sequence of operations
that include inputs and outputs
• Step-by-step descriptions of procedures
• A protocol is a type of workflow
• They must be described in sufficient and
unambiguous detail.
• To enable another agent (human or machine)
to replicate the original experiment.
• Specific journals: Biotechniques, CSH
protocols, Current protocols, GMR, Jove,
Protocol exchange, Plant methods, Plos One,
Springer protocols
8. And other useful elements, including ontologies
It maintains checklists that promote how to report an experiment.
It models the design of an
investigation. Including
protocols, instrumentation,
materials and data generated.
Aims to formalize
knowledge about the
organization, execution and
analysis of scientific
experiments.
EXPO
EXACT
It provides a model for the description of experiment actions.
Minimal information models, check lists, and even ontologies
9. However…
• Ambiguity is the norm
• Let’s make an analysis
on protocols written for
the plant biology
community
• Incubate the centrifuge tubes
in a water bath.
•Incubate the samples for 5 min
with gentle shaking.
• Rinse DNA briefly in 1-2 ml of
wash.
•Incubate at -20C overnight.
Protocol
10. Analysis of Laboratory Protocols
Repository Number of Protocols
Biotechniques 8
CSH protocols 11
Current protocols 25
GMR 4
Jove 21
Protocol exchange 12
Plant methods 10
Plos One 3
Springer protocols 5
Total 99
11. Minimal Information to Report a Laboratory Protocol
Our model
Ocurrence in
other models
TITLE 100%
AUTHOR 100%
INTRODUCTION
Purpose 89%
Provenance of the protocol 89%
Applications of the protocol 89%
Comparison with other protocols 89%
Limitations 89%
MATERIALS
Sample 100%
· strain or line genotype
· Developmental stage
· Organism part (tissue)
Laboratory consumables/supplies
· Laboratory consumable name 22%
· Manufacturer name 11%
· Laboratory consumable ID (catalog number) 11%
Buffer recipes
· Buffer name 67%
· Chemical compound name 67%
· Initial concentration of chemical compound 67%
· Final concentration or amount of chemical
compound
56%
· Storage conditions 56%
· Cautions 56%
· Hints 67%
Our model
Ocurrence in
other models
Reagent
· Reagent name 100%
· Reagent vendor or manufacturer 100%
· Reagent ID (catalog number) 100%
Kit
· Kit name 100%
· Kit vendor or manufacturer 100%
· Kit ID (catalog number) 56%
Primer
· Primer name 67%
· Primer sequence 89%
· Primer vendor or manufacturer 33%
Equipment
· Equipment name 67%
· Equipment vendor or Manufacturer 67%
· Equipment ID (catalog number) 67%
Software
· Software name 67%
· Software version 67%
METHODS/PROCEDURE
Protocol 100%
· Cautions 56%
· Critical steps 56%
· Pause point 33%
· Hints 22%
· Troubleshooting 44%
12. How to Formalize the Protocols?
• Incubate the centrifuge
tubes at 65°C in a water bath
for 10 min.
• Rinse DNA briefly in 1-2 ml
of wash.
•Incubate at -20C overnight.
Protocol
indicate different length of time
2 seconds?, 5-10 seconds?...
Object: centrifuge tubes, water bath
Unit of measure: 65C, 10 min.
Action: incubate.
14. Currently working on protocol annotation
plant material
instrument name
manufacturer
Buffer recipe
Reagent name
Laboratory
consumable
name
Source: Biotechniques
Meta-information
about content
Content
Plant material Arabidopsis thaliana (rosette
leaves, flowers, siliques),… and
Larix decidua (young needles)
Instrument name Leitz DMRB microscope
manufacturer Leica Micro-systems
Buffer recipe 50 mM EDTA, 1.4% SDS
Reagent name 96% ethanol ~ absolute ethanol
Laboratory
consumable name
2-mL tube, zeolite beads
15. 15
From the wet lab to our computers
Lab book
Digital
Log
Laboratory Protocol
(recipe)
Workflow
Experiment
17. Scientific Workflows
17
“Template defining the set of tasks needed
to carry out a computational experiment”
[1]
•Inputs
•Steps
•Intermediate results
•Outputs
•Data driven, usually represented as
Directed Acyclic Graphs (DAGs)
[1] Ewa Deelman, Dennis Gannon, Matthew Shields, Ian Taylor, Workflows and e-science: an overview of
workflow system features and capabilities, Future Generation Computer Systems 25 (5) (2009) 528–540.
19. What do I want from these workflows and repositories?
19
• As a designer: Discovery
•Workflows with similar functionality fragments/methods
•Design based in previous templates.
• As user/reuser/reviewer: Understandability, Exploration
•Search workflows by functionality
•Commonalities between execution runs
•Component categorization
•Reproducibility
Workflow 1
20. Working on different aspects of workflow preservation
•Workflow representation
•Plan/template representation
•Provenance trace representation
•Link between templates and traces
•Creation of abstractions/motifs in scientific workflows
•Abstraction catalog
•Find how different workflows are
related
•Understandability and reuse of scientific workflows
•Relation between the
workflows involved in the
same experiment
(Research Objects)
20
CH1: Can we export an abstract template of the
method being represented?
CH2: How do we interoperate with other
workflow results?
CH3: How do we access the workflow results?
CH4: How do we link an abstract method with
several implementations?
CH5: How can we detect what are the
typical operations in scientific workflows?
CH6: How can we detect them
automatically?
CH7: Which workflow parts are related to other
workflows?
CH8: How do workflows depend on the other parts of the
experiments?
21. 21
Overview
• Empirical analysis on 260 workflow templates from Taverna,
Wings, Galaxy and Vistrails
• Catalog of recurring patterns: scientific
workflow motifs.
• Data Oriented Motifs
• Workflow Oriented Motifs
•Understandability and reuse
http://sensefinancial.com/wp-content/uploads/2012/02/contribution.jpg
Common motifs in scientific workflows: An empirical analysis. Garijo, D.; Alper, P.; Belhajjame, K.;
Corcho, O.; Gil, Y.; and Goble, C. Future Generation Computer Systems, . 2013
22. 22
Approach
•Reverse-engineer the set of current practices in workflow
development through an analysis of empirical evidence
•Identify workflow abstractions that would facilitate
understandability and therefore effective re-use
23. 23
Motif Catalog
Data-Oriented Motifs (What?)
Data Retrieval
Data Preparation
Format Transformation
Input Augmentation
and Output Splitting
Data Organisation
Data Analysis
Data Curation/Cleaning
Data Moving
Data Visualisation
Workflow-Oriented Motifs (How?)
Intra-Workflow Motifs
Stateful (Asynchronous) Invocations
Stateless (Synchronous) Invocations
Internal Macros
Human Interactions
Inter-Workflow Motifs
Atomic Workflows
Composite Workflows
Workflow Overloading
Ontology Purl: http://purl.org/net/wf-motifs
24. Macro abstraction detection
Problem statement:
Given a repository of workflow templates (either abstract or specific) or
workflow execution traces, what are the workflow fragments I can deduce
from it?
Useful for:
•Systems like Taverna and Wings: (Many templates, little annotation to relate
them)
•Finding relationships between workflows and sub-workflows.
•Most used fragments, most executed, etc.
•Systems like GenePattern, LONI Pipeline and Galaxy: (Many runs, nearly
no templates published)
•Proposing new templates with the popular fragments.
24
25. 25
Common workflow fragment detection
[Holder et al 1994]: Substructure Discovery in the SUBDUE System L. B. Holder, D. J. Cook, and S.
Djoko. AAAI Workshop on Knowledge Discovery, pages 169-180, 1994.
•Given a collection of workflows, which are the most common fragments?
•Common sub-graphs among the collection
•Sub-graph isomorphism (NP-complete)
•We use subgraph mining algorithms
•Graph Grammar learning
•The rules of the grammar are the workflow fragments
•Graph based hierarchical clustering
•Each cluster corresponds to a workflow fragment
•Iterative algorithm with two measures for compressing the graph:
•Minimum Description Length (MDL)
•Size
30. Working on different aspects of workflow preservation
•Workflow representation
•Plan/template representation
•Provenance trace representation
•Link between templates and traces
•Creation of abstractions/motifs in scientific workflows
•Abstraction catalog
•Find how different workflows are
related
•Understandability and reuse of scientific workflows
•Relation between the
workflows involved in the
same experiment
(Research Objects)
30
CH1: Can we export an abstract template of the
method being represented?
CH2: How do we interoperate with other
workflow results?
CH3: How do we access the workflow results?
CH4: How do we link an abstract method with
several implementations?
CH5: How can we detect what are the
typical operations in scientific workflows?
CH6: How can we detect them
automatically?
CH7: Which workflow parts are related to other
workflows?
CH8: How do workflows depend on the other parts of the
experiments?
31. 31
What is a Research Object?
•Aggregation of resources that bundles together the
contents of a research work:
•Data
•Experiments
•Examples
•Bibliography
•Annotations
•Provenance
•ROs
•Etc.
http://www.researchobject.org/
Workflow-Centric Research Objects: First Class Citizens in Scholarly Discourse. Belhajjame,
K.; Corcho, O.; Garijo, D.; Zhao, J.; Missier, P.; Newman, D.; Palma, R.; Bechhofer, S.; Garcıa, E.;
Manuel, .G. J.; Klyne, G.; Page, K.; Roos, M.; Ruiz, J. E.; Soiland-Reyes, S.; Verdes-Montenegro, L.;
De Roure, D.; and Goble, C. In Proceedings of the Second International Conference on the Future of
Scholarly Communication and Scientific Publishing Sepublica2012, page 1-12, Hersonissos, 2012
35. The role of annotation in
reproducibility
ESWC2014 Empirical workshop
26/05/2014
Contributors: my PhD students Olga Giraldo, Daniel
Garijo, and Idafen Santana, and the Wf4Ever team
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
https://www.slideshare.com/ocorcho
36. A final note on terminology
Source: Idafen Santana; Inspired by [Goble, 2012]
Notes de l'éditeur
However when the protocols are published some of them present problems such as insufficient granularity and the instructions can be imprecise or ambiguous due to the natural language. In order to avoid arbitrary interpretations, we are designing an ontological structure that facilitate the formal representation of experimental protocols.
However when the protocols are published some of them present problems such as insufficient granularity and the instructions can be imprecise or ambiguous due to the natural language. In order to avoid arbitrary interpretations, we are designing an ontological structure that facilitate the formal representation of experimental protocols.
This is the What vs How vs Why.
Why test more than 1 subgraph algorithm? Because we want to compare the fragments obtained from the algorithms.
There is always a trade of between the size of the subgraph and its frequency.