SlideShare a Scribd company logo
1 of 49
Download to read offline
Sharing massive data analysis : 

from provenance to linked
experiment reports
Scientific workflows, provenance and linked data to the rescue
Alban Gaignard, PhD, CNRS
Toulouse, 13 november 2018
APSEM 2018
Reproducibility
at the age of massive data
Analysis ?
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLOS Biology 13(7): e1002195. https://doi.org/10.1371/journal.pbio.1002195
Massive life science data production
Knowledge production
« In 2012, Amgen researchers made
headlines when they declared that they
had been unable to reproduce the
findings  in 47 of 53 'landmark' cancer
papers » (doi:10.1038/nature.2016.19269)
Repeat < Replicate < Reproduce < Reuse
Same experiment
Same setup
Same lab
Same experiment
Same setup
Same lab
Same experiment
Same setup
Same lab
New ideas, 

new experiment,
some comonalities
S. Cohen-Boulakia, K. Belhajjame, O. Collin, J. Chopard, C. Froidevaux, A. Gaignard, K. Hinsen, P. Larmande, Y. Le Bras, F. Lemoine, F. Mareuil, H. Ménager, C. Pradal, C. Blanchet,
Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities, Future Generation Computer Systems, Volume 75, 2017,
https://doi.org/10.1016/j.future.2017.01.012 .
Intracranial aneurysms :
localized dilation or
ballooning in cerebral
blood vessels
Objective, systematic,
reapatable measures ?
Scientific workflows to
the rescue …
What is a workflow ?
Malcolm Atkinson, Sandra Gesing, Johan Montagnat, Ian Taylor. Scientific workflows: Past, present and future. Future Generation Computer Systems, Elsevier, 2017,
75, pp.216 - 227. <10.1016/j.future.2017.05.041>
« Workflows provide a systematic way of describing the
methods needed and provide the interface between 

domain specialists and computing infrastructures. » 

« Workflow management systems (WMS) perform the
complex analyses on a variety of distributed resources »
Scientific workflows to enhance trust in
scientific results :
➞ automate data analysis (at scale)
➞ abstraction (describe/share
methods)
➞ provenance (~transparency)
A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin
#awesome-list #workflow
CONTRIBUTING.md Added contributing 4 years ago
README.md Update README.md 25 days ago
README.md
Awesome Pipeline
A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin
Pipeline frameworks & libraries
ActionChain - A workflow system for simple linear success/failure workflows.
Adage - Small package to describe workflows that are not completely known at definition time.
Airflow - Python-based workflow system created by AirBnb.
Anduril - Component-based workflow framework for scientific data analysis.
Antha - High-level language for biology.
Bds - Scripting language for data pipelines.
BioMake - GNU-Make-like utility for managing builds and complex workflows.
BioQueue - Explicit framework with web monitoring and resource estimation.
Bistro - Library to build and execute typed scientific workflows.
Bpipe - Tool for running and managing bioinformatics pipelines.
Briefly - Python Meta-programming Library for Job Flow Control.
Cluster Flow - Command-line tool which uses common cluster managers to run bioinformatics pipelines.
Clusterjob - Automated reproducibility, and hassle-free submission of computational jobs to clusters.
Compss - Programming model for distributed infrastructures.
Conan2 - Light-weight workflow management application.
Consecution - A Python pipeline abstraction inspired by Apache Storm topologies.
Cosmos - Python library for massively parallel workflows.
Cromwell - Workflow Management System geared towards scientific workflows from the Broad Institute.
Cuneiform - Advanced functional workflow language and framework, implemented in Erlang.
Dagobah - Simple DAG-based job scheduler in Python.
Dagr - A scala based DSL and framework for writing and executing bioinformatics pipelines as Directed Acyclic GRaphs.
Dask - Dask is a flexible parallel computing library for analytics.
Dockerflow - Workflow runner that uses Dataflow to run a series of tasks in Docker.
Doit - Task management & automation tool.
Drake - Robust DSL akin to Make, implemented in Clojure.
Drake R package - Reproducibility and high-performance computing with an easy R-focused interface. Unrelated to
Factual's Drake.
Dray - An engine for managing the execution of container-based workflows.
Fission Workflows - A fast, lightweight workflow engine for serverless/FaaS functions.
pditommaso / awesome-pipeline
228 commits 1 branch 0 releases 42 contributors
masterBranch: New pull request Create new file Upload files Find file Clone or download
Update README.md Latest commit 07ee96a 25 days agopditommaso
Provenance : a way to reuse
produced & analysed data
Definition: Oxford dictionnary
« The beginning of something's existence; something's origin. » 
Definition: Computer Science
« Provenance information describes the origins and the history of data in
its life cycle. » 
« Today, data is often made available on the Internet with no centralized
control over its integrity: data is constantly being created, copied,
moved around, and combined indiscriminately. Because information
sources (or different parts of a single large source) may vary widely in terms
of quality, it is essential to provide provenance and other context
information which can help end users judge whether query results are
trustworthy. »
James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Found. Trends databases 1, 4 (April 2009), 379-474. DOI=http://
dx.doi.org/10.1561/1900000006
Representing provenance
Learning
model
Unknown
dataset
Predictions
Training
dataset
Features
Feature extraction Learning Task Prediction Task
Learning
model
Unknown
dataset
Predictions
run #2 run #3
wasGeneratedBy
wasGeneratedBy
used
used
used
Training
dataset
run #1
Features
used
wasGeneratedBy
Feature extraction Learning Task Prediction Task
Feature extraction
software
Learning
model
Unknown
dataset
Predictions
run #2 run #3
a researcher
Random forests
software
wasAttributedTo
wasAttributedTo wasAttributedTo
	wasAttributedTo						
wasGeneratedBy
wasGeneratedBy
used
used
used
Training
dataset
run #1
Features
used
wasGeneratedBy
Feature extraction Learning Task Prediction Task
Reasoning with provenance
Feature extraction
software
a researcher
wasAttributedTo
wasAttributedTo
Training
dataset
run #1
Features
used
wasGeneratedBy
wasDerivedFrom
« IF some data was produced by a tool
based on some input parameters,

THEN this data derives from the input
parameters »
PROV,
how-to record/query ?
Writing PROV statements
22
Writing PROV statements
23
Querying PROV graphs
24
Querying PROV graphs
25
PROV store
26
Still open issues …
Reuse instead of
re-execution ?
RNA-seq data analysis workflow
Compute and
storage intensive
Avoid duplicated
storage / computing
TopHat 1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
Is provenance enough for reuse ?
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
Too fine-grained
No domain concepts
Human & machine-tractable report needed !
Annotated paper’s “Material & Methods” with links to some
workflow artifacts (algorithms, data).
Problem statement & objectives
Objectives
(1)Leverage annotated workflow patterns to generate 

provenance mining rules.
(2)Refine provenance traces into linked experiment reports.
Scientific workflows produce massive raw results. Their
publication into curated query-able linked data repositories
requires lot of time and expertise.
Can we exploit provenance traces to ease the publication
of scientific results as Linked Data ?
Problem statement
Approach
A. Gaignard, H. Skaf-Molli, A. Bihouée: From Scientific Workflow Patterns to 5-star Linked Open Data. 8th USENIX Workshop on the Theory and Practice of Provenance, 

TaPP 2016.
PoeM: generating PrOvEnance Mining rules ❸
SPARQL Property path
SPARQL Basic graph pattern
SPARQL Construct query
Demo
Provenance 

in multi-site studies ?
Multi-site studies ➞ ≠ workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
Scattered provenance capture ?
Provenance issues
« Which alignment algorithm was used when predicting these
effects ? »
« A new version of a reference genome is available, which
genome was used when predicting these phenotypes ? »
Need for an overall tracking of provenance over both
Galaxy and Taverna workflows !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
Provenance « heterogeneity »
How to reconcile these
provenance traces ?
Approach
40
A. Gaignard, K. Belhajjame, H. Skaf-Molli. SHARP: Harmonizing and Bridging Cross-Workflow Provenance. The Semantic Web: ESWC 2017 Satellite Events Portorož,
Slovenia, May 28 – June 1, 2017, Revised Selected Papers, 2017
Results
Reconciled provenance as
an « influence graph »
https://github.com/albangaignard/sharp-prov-toolbox
Linked experiment report
with Nanopublication,
domain-specific concepts
Summary
Take home message & perspectives
• Scientific Workflows ➞ automation, abstraction, provenance
• Standards for provenance representation and reasoning
• Better handle multi-site studies (ESWC’17 satelite event paper)
• Linked experiment reports = contextualized and summarized
provenance (TaPP’16 paper)
• Distributed data analysis ➞ Distributed provenance, reasoning ?
• Learning patterns in provenance graphs ?
• Predicting domain-specific annotation for workflow results ? 

What about trust ?
43
Acknowledgments
Audrey Bihouée, Institut du
Thorax, BiRD Bioinformatics
facility, University of Nantes
Hala Skaf-Molli, LS2N,
University of Nantes
Khalid Belhajjame,
LAMSADE, University of
Paris-Dauphine, PSL
action ReproVirtuFlow
GDR
Backup slides
Research Objects
Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Phillip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble,
Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble (2013) Why Linked Data is Not Enough for Scientists, Future Generation Computer
Systems 29(2), February 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004
Khalid Belhajjame, Jun Zhao, Daniel Garijo, Matthew Gamble, Kristina Hettne, Raul Palma, Eleni Mina, Oscar Corcho, José Manuel Gómez-Pérez, Sean Bechhofer, Graham Klyne,
Carole Goble (2015) Using a suite of ontologies for preserving workflow-centric research objects, Web Semantics: Science, Services and Agents on the World
Wide Web, https://doi.org/10.1016/j.websem.2015.01.003
schema.org Action
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports

More Related Content

What's hot

Reusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureReusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureDavid LeBauer
 
FireWorks overview
FireWorks overviewFireWorks overview
FireWorks overviewAnubhav Jain
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Anubhav Jain
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Kim Herzig
 
Software Metadata: Describing "dark software" in GeoSciences
Software Metadata: Describing "dark software" in GeoSciencesSoftware Metadata: Describing "dark software" in GeoSciences
Software Metadata: Describing "dark software" in GeoSciencesdgarijo
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shangSAIL_QU
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
 
Materials Informatics and Python
Materials Informatics and PythonMaterials Informatics and Python
Materials Informatics and PythonShintaro Fukushima
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Anubhav Jain
 
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Databricks
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...University of California, San Diego
 
Aspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth ScienceAspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth ScienceRaul Palma
 

What's hot (20)

Reusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize AgricultureReusable Software and Open Data To Optimize Agriculture
Reusable Software and Open Data To Optimize Agriculture
 
FireWorks overview
FireWorks overviewFireWorks overview
FireWorks overview
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
 
ICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials ProjectICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials Project
 
Software Metadata: Describing "dark software" in GeoSciences
Software Metadata: Describing "dark software" in GeoSciencesSoftware Metadata: Describing "dark software" in GeoSciences
Software Metadata: Describing "dark software" in GeoSciences
 
Reproducibility 1
Reproducibility 1Reproducibility 1
Reproducibility 1
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 
Materials Informatics and Python
Materials Informatics and PythonMaterials Informatics and Python
Materials Informatics and Python
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...
 
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
 
Aspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth ScienceAspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth Science
 

Similar to Sharing massive data analysis: from provenance to linked experiment reports

Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceCarole Goble
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Knowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceKnowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceDavid De Roure
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
The Research Object Initiative: Frameworks and Use Cases
The Research Object Initiative:Frameworks and Use CasesThe Research Object Initiative:Frameworks and Use Cases
The Research Object Initiative: Frameworks and Use CasesCarole Goble
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010Ian Foster
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and modelsmyGrid team
 
Collaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsCollaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsAndrea Wiggins
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudData Finder
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeCarole Goble
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
 

Similar to Sharing massive data analysis: from provenance to linked experiment reports (20)

Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Knowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceKnowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems Science
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
The Research Object Initiative: Frameworks and Use Cases
The Research Object Initiative:Frameworks and Use CasesThe Research Object Initiative:Frameworks and Use Cases
The Research Object Initiative: Frameworks and Use Cases
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
 
Collaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsCollaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna Workflows
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloud
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
 

Recently uploaded

Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfrohankumarsinghrore1
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curveAreesha Ahmad
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 

Sharing massive data analysis: from provenance to linked experiment reports

  • 1. Sharing massive data analysis : 
 from provenance to linked experiment reports Scientific workflows, provenance and linked data to the rescue Alban Gaignard, PhD, CNRS Toulouse, 13 november 2018 APSEM 2018
  • 2. Reproducibility at the age of massive data
  • 3. Analysis ? Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, et al. (2015) Big Data: Astronomical or Genomical?. PLOS Biology 13(7): e1002195. https://doi.org/10.1371/journal.pbio.1002195 Massive life science data production
  • 4. Knowledge production « In 2012, Amgen researchers made headlines when they declared that they had been unable to reproduce the findings  in 47 of 53 'landmark' cancer papers » (doi:10.1038/nature.2016.19269)
  • 5. Repeat < Replicate < Reproduce < Reuse Same experiment Same setup Same lab Same experiment Same setup Same lab Same experiment Same setup Same lab New ideas, 
 new experiment, some comonalities S. Cohen-Boulakia, K. Belhajjame, O. Collin, J. Chopard, C. Froidevaux, A. Gaignard, K. Hinsen, P. Larmande, Y. Le Bras, F. Lemoine, F. Mareuil, H. Ménager, C. Pradal, C. Blanchet, Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities, Future Generation Computer Systems, Volume 75, 2017, https://doi.org/10.1016/j.future.2017.01.012 .
  • 6. Intracranial aneurysms : localized dilation or ballooning in cerebral blood vessels Objective, systematic, reapatable measures ?
  • 8. What is a workflow ? Malcolm Atkinson, Sandra Gesing, Johan Montagnat, Ian Taylor. Scientific workflows: Past, present and future. Future Generation Computer Systems, Elsevier, 2017, 75, pp.216 - 227. <10.1016/j.future.2017.05.041> « Workflows provide a systematic way of describing the methods needed and provide the interface between 
 domain specialists and computing infrastructures. » 
 « Workflow management systems (WMS) perform the complex analyses on a variety of distributed resources »
  • 9. Scientific workflows to enhance trust in scientific results : ➞ automate data analysis (at scale) ➞ abstraction (describe/share methods) ➞ provenance (~transparency) A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin #awesome-list #workflow CONTRIBUTING.md Added contributing 4 years ago README.md Update README.md 25 days ago README.md Awesome Pipeline A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin Pipeline frameworks & libraries ActionChain - A workflow system for simple linear success/failure workflows. Adage - Small package to describe workflows that are not completely known at definition time. Airflow - Python-based workflow system created by AirBnb. Anduril - Component-based workflow framework for scientific data analysis. Antha - High-level language for biology. Bds - Scripting language for data pipelines. BioMake - GNU-Make-like utility for managing builds and complex workflows. BioQueue - Explicit framework with web monitoring and resource estimation. Bistro - Library to build and execute typed scientific workflows. Bpipe - Tool for running and managing bioinformatics pipelines. Briefly - Python Meta-programming Library for Job Flow Control. Cluster Flow - Command-line tool which uses common cluster managers to run bioinformatics pipelines. Clusterjob - Automated reproducibility, and hassle-free submission of computational jobs to clusters. Compss - Programming model for distributed infrastructures. Conan2 - Light-weight workflow management application. Consecution - A Python pipeline abstraction inspired by Apache Storm topologies. Cosmos - Python library for massively parallel workflows. Cromwell - Workflow Management System geared towards scientific workflows from the Broad Institute. Cuneiform - Advanced functional workflow language and framework, implemented in Erlang. Dagobah - Simple DAG-based job scheduler in Python. Dagr - A scala based DSL and framework for writing and executing bioinformatics pipelines as Directed Acyclic GRaphs. Dask - Dask is a flexible parallel computing library for analytics. Dockerflow - Workflow runner that uses Dataflow to run a series of tasks in Docker. Doit - Task management & automation tool. Drake - Robust DSL akin to Make, implemented in Clojure. Drake R package - Reproducibility and high-performance computing with an easy R-focused interface. Unrelated to Factual's Drake. Dray - An engine for managing the execution of container-based workflows. Fission Workflows - A fast, lightweight workflow engine for serverless/FaaS functions. pditommaso / awesome-pipeline 228 commits 1 branch 0 releases 42 contributors masterBranch: New pull request Create new file Upload files Find file Clone or download Update README.md Latest commit 07ee96a 25 days agopditommaso
  • 10. Provenance : a way to reuse produced & analysed data
  • 11. Definition: Oxford dictionnary « The beginning of something's existence; something's origin. »  Definition: Computer Science « Provenance information describes the origins and the history of data in its life cycle. »  « Today, data is often made available on the Internet with no centralized control over its integrity: data is constantly being created, copied, moved around, and combined indiscriminately. Because information sources (or different parts of a single large source) may vary widely in terms of quality, it is essential to provide provenance and other context information which can help end users judge whether query results are trustworthy. » James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Found. Trends databases 1, 4 (April 2009), 379-474. DOI=http:// dx.doi.org/10.1561/1900000006
  • 13.
  • 15. Learning model Unknown dataset Predictions run #2 run #3 wasGeneratedBy wasGeneratedBy used used used Training dataset run #1 Features used wasGeneratedBy Feature extraction Learning Task Prediction Task
  • 16. Feature extraction software Learning model Unknown dataset Predictions run #2 run #3 a researcher Random forests software wasAttributedTo wasAttributedTo wasAttributedTo wasAttributedTo wasGeneratedBy wasGeneratedBy used used used Training dataset run #1 Features used wasGeneratedBy Feature extraction Learning Task Prediction Task
  • 17.
  • 19.
  • 20. Feature extraction software a researcher wasAttributedTo wasAttributedTo Training dataset run #1 Features used wasGeneratedBy wasDerivedFrom « IF some data was produced by a tool based on some input parameters,
 THEN this data derives from the input parameters »
  • 29. RNA-seq data analysis workflow Compute and storage intensive Avoid duplicated storage / computing TopHat 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb
  • 30. Is provenance enough for reuse ? 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb Too fine-grained No domain concepts
  • 31. Human & machine-tractable report needed ! Annotated paper’s “Material & Methods” with links to some workflow artifacts (algorithms, data).
  • 32. Problem statement & objectives Objectives (1)Leverage annotated workflow patterns to generate 
 provenance mining rules. (2)Refine provenance traces into linked experiment reports. Scientific workflows produce massive raw results. Their publication into curated query-able linked data repositories requires lot of time and expertise. Can we exploit provenance traces to ease the publication of scientific results as Linked Data ? Problem statement
  • 33. Approach A. Gaignard, H. Skaf-Molli, A. Bihouée: From Scientific Workflow Patterns to 5-star Linked Open Data. 8th USENIX Workshop on the Theory and Practice of Provenance, 
 TaPP 2016.
  • 34. PoeM: generating PrOvEnance Mining rules ❸ SPARQL Property path SPARQL Basic graph pattern SPARQL Construct query
  • 35. Demo
  • 37. Multi-site studies ➞ ≠ workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 Scattered provenance capture ?
  • 38. Provenance issues « Which alignment algorithm was used when predicting these effects ? » « A new version of a reference genome is available, which genome was used when predicting these phenotypes ? » Need for an overall tracking of provenance over both Galaxy and Taverna workflows ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37
  • 39. Provenance « heterogeneity » How to reconcile these provenance traces ?
  • 40. Approach 40 A. Gaignard, K. Belhajjame, H. Skaf-Molli. SHARP: Harmonizing and Bridging Cross-Workflow Provenance. The Semantic Web: ESWC 2017 Satellite Events Portorož, Slovenia, May 28 – June 1, 2017, Revised Selected Papers, 2017
  • 41. Results Reconciled provenance as an « influence graph » https://github.com/albangaignard/sharp-prov-toolbox Linked experiment report with Nanopublication, domain-specific concepts
  • 43. Take home message & perspectives • Scientific Workflows ➞ automation, abstraction, provenance • Standards for provenance representation and reasoning • Better handle multi-site studies (ESWC’17 satelite event paper) • Linked experiment reports = contextualized and summarized provenance (TaPP’16 paper) • Distributed data analysis ➞ Distributed provenance, reasoning ? • Learning patterns in provenance graphs ? • Predicting domain-specific annotation for workflow results ? 
 What about trust ? 43
  • 44. Acknowledgments Audrey Bihouée, Institut du Thorax, BiRD Bioinformatics facility, University of Nantes Hala Skaf-Molli, LS2N, University of Nantes Khalid Belhajjame, LAMSADE, University of Paris-Dauphine, PSL action ReproVirtuFlow GDR
  • 46. Research Objects Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Phillip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble (2013) Why Linked Data is Not Enough for Scientists, Future Generation Computer Systems 29(2), February 2013, Pages 599-611, ISSN 0167-739X, https://doi.org/10.1016/j.future.2011.08.004 Khalid Belhajjame, Jun Zhao, Daniel Garijo, Matthew Gamble, Kristina Hettne, Raul Palma, Eleni Mina, Oscar Corcho, José Manuel Gómez-Pérez, Sean Bechhofer, Graham Klyne, Carole Goble (2015) Using a suite of ontologies for preserving workflow-centric research objects, Web Semantics: Science, Services and Agents on the World Wide Web, https://doi.org/10.1016/j.websem.2015.01.003