SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
From Scientific Workflow Patterns
to 5-star Linked Open Data
Alban Gaignard Hala Skaf-Molli Audrey Bihouée
Nantes Academic
Hospital (CHU de
Nantes), CNRS, France
LINA,
Nantes University,
CNRS,
France
Institut du Thorax,
Nantes University,
INSERM, CNRS,
France
8th USENIX workshop on Theory and Practice of Provenance (TaPP’16)
Washington DC
Needs for linked experiment reports
2
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
3
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
4
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
5
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
6
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
Challenges
Algorithmic performance, storage, preservation,
reuse (limit recompute) & share.
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing experiment results
Scientific experiment: RNA sequencing to quantify gene expression
levels under multiple biological conditions.
7
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing experiment results
Scientific experiment: RNA sequencing to quantify gene expression
levels under multiple biological conditions.
8
Need for scientific context : metadata
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Expected result: human+machine tractable reports
9
Annotated “Material & Methods”
Links to some workflow artifacts (algorithms, data)
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
5-star Linked Open Data
10
W3C standards for machine and human
readable data on the web.
⭑⭑⭑⭑⭑ : time and expertise !
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
5-star Linked Open Data
11
How to ease this process ?
● Workflow engines → automation
● PROV → workflow runs as linked data
W3C standards for machine and human
readable data on the web.
⭑⭑⭑⭑⭑ : time and expertise !
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PROV only
12
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PROV only
13
too fine-grained
no domain concepts
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Provenance as a Linked Experiment Report
14
few + meaningful statements
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Problem statement & objectives
15
Problem statement
Scientific workflows produce massive raw results. Their
publication into curated query-able linked data repositories
requires lot of time and expertise.
Can we exploit provenance traces to ease the publication of
scientific results as Linked Data ?
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Problem statement & objectives
16
Problem statement
Scientific workflows produce massive raw results. Their
publication into curated query-able linked data repositories
requires lot of time and expertise.
Can we exploit provenance traces to ease the publication of
scientific results as Linked Data ?
Objectives
(1) Leverage annotated workflow patterns to generate
provenance mining rules.
(2) Refine provenance traces into linked experiment reports.
Rules generation
17
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Approach
18
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Input domain-specific annotations (❶,❷)
Workflow patterns ❶
19
Sequence patterns, with possibly intermediate steps
● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar
● EDAM ontology: hasFunction, RNA sequence, Genome map
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Input domain-specific annotations (❶,❷)
Workflow patterns ❶
20
Sequence patterns, with possibly intermediate steps
● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar
● EDAM ontology: hasFunction, RNA sequence, Genome map
Experiment report template ❷
Link scientific claims, statements, material and methods
● MicroPublication ontology: Material, Method, Claims
● Experimental factor ontology: Transcriptome, Gene expression
● NCBI taxonomy: Homo Sapiens
● Open Annotation model: hasBody, hasTarget
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PoeM: generating PrOvEnance Mining rules ❸
21
(SPARQL Property path)
(SPARQL Basic graph pattern)
(SPARQL Construct query)
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PoeM: sample generated rule ❸
22
<If> part
<Then> part
First experiments &
results
23
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment context
24
SyMeTRIC: systems medicine project (2015-2017, call “Connect
Talent”), funded by the french region Pays de la Loire.
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment
25
Material & methods
● Real-life RNA-seq workflow to study 3 mice populations
● WF implemented in Galaxy, run on 2 biological samples
● PROV traces exported from Galaxy Histories (API)
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment
26
Material & methods
● Real-life RNA-seq workflow to study 3 mice populations
● WF implemented in Galaxy, run on 2 biological samples
● PROV traces exported from Galaxy Histories (API)
Results (for 1 biological sample)
● 60h CPU (12 cores for genome alignment), 21Gb storage
● 3s to export 81 PROV triples from the Galaxy history
● 2s to apply the rule and produce 35 Micropublication triples
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Conclusion & perspectives
27
Semi-automated approach
(1) PoeM generates semantic web rules
(2) PoeM rules applied on PROV traces to assemble
linked experiment reports (MicroPublication)
Limitations:
- Sequence workflow patterns only
- SPARQL property paths with complex WF patterns ?
- Syntactic matching between WF patterns and PROV labels
Usage scenarios:
→ Query workflow datasets with domain concepts
→ Populate RDF repositories with WF results
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Conclusion & perspectives
28
Future works
(1) WF patterns: split-merge, “common motifs”
(2) Genericity: other domains / other reports (RO, Nanopub.)
(3) PROV heterogeneity: multi-systems PROV
(4) Evaluation: involving biologists, at larger scale
Questions ?
Demo: http://poem.univ-nantes.fr
Contact: alban.gaignard@univ-nantes.fr
Acknowledgments
BiRD bioinformatics facility Connect Talent Call
TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 30
Larger scale experiments (PROV traces)
1232 edges
TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 31
Larger scale experiments (PROV traces)
49 edges

Contenu connexe

Tendances

Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 
Mapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesMapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesYasset Perez-Riverol
 
Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...Yasset Perez-Riverol
 
Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录insight-labs
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog libraryNGDATA
 
New methods hackathon mhc project
New methods   hackathon mhc projectNew methods   hackathon mhc project
New methods hackathon mhc projectGenomeInABottle
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysisYun Lung Li
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Genome Reference Consortium
 
Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013Elsa von Licy
 
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...GENALICE
 
Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...GENALICE
 

Tendances (20)

Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Mapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesMapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome Coordinates
 
Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...
 
Ashg grc workshop2015_tg
Ashg grc workshop2015_tgAshg grc workshop2015_tg
Ashg grc workshop2015_tg
 
agbt 2016 workshop lindsay
agbt 2016 workshop lindsayagbt 2016 workshop lindsay
agbt 2016 workshop lindsay
 
Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog library
 
New methods hackathon mhc project
New methods   hackathon mhc projectNew methods   hackathon mhc project
New methods hackathon mhc project
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
Sept2016 sv nabsys
Sept2016 sv nabsysSept2016 sv nabsys
Sept2016 sv nabsys
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT 2016 Workshop Magrini
AGBT 2016 Workshop MagriniAGBT 2016 Workshop Magrini
AGBT 2016 Workshop Magrini
 
Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013
 
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
 
Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 

Similaire à PoemTapp16

The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...Baptiste Mayjonade
 
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Alejandra Gonzalez-Beltran
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009Ian Foster
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartAraport
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128GenomeInABottle
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizeAnn Loraine
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposalGenomeInABottle
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’ Paolo Missier
 
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...GenomeInABottle
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngsDin Apellidos
 
Bic bellec 2016_small
Bic bellec 2016_smallBic bellec 2016_small
Bic bellec 2016_smallPierre Bellec
 
SHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenanceSHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenanceGaignard Alban
 
SHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSyed Muhammad Ali Hasnain
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge Proto204
 

Similaire à PoemTapp16 (20)

The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
 
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
 
Thesis biobix
Thesis biobixThesis biobix
Thesis biobix
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick Provart
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasets
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’
 
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
 
ISMB Workshop 2014
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs
 
Bic bellec 2016_small
Bic bellec 2016_smallBic bellec 2016_small
Bic bellec 2016_small
 
SHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenanceSHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenance
 
SHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenance
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge
 

Dernier

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Dernier (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

PoemTapp16

  • 1. From Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic Hospital (CHU de Nantes), CNRS, France LINA, Nantes University, CNRS, France Institut du Thorax, Nantes University, INSERM, CNRS, France 8th USENIX workshop on Theory and Practice of Provenance (TaPP’16) Washington DC
  • 2. Needs for linked experiment reports 2
  • 3. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 3
  • 4. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 4 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  • 5. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 5 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  • 6. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 6 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb Challenges Algorithmic performance, storage, preservation, reuse (limit recompute) & share. 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  • 7. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing experiment results Scientific experiment: RNA sequencing to quantify gene expression levels under multiple biological conditions. 7
  • 8. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing experiment results Scientific experiment: RNA sequencing to quantify gene expression levels under multiple biological conditions. 8 Need for scientific context : metadata
  • 9. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Expected result: human+machine tractable reports 9 Annotated “Material & Methods” Links to some workflow artifacts (algorithms, data)
  • 10. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée 5-star Linked Open Data 10 W3C standards for machine and human readable data on the web. ⭑⭑⭑⭑⭑ : time and expertise !
  • 11. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée 5-star Linked Open Data 11 How to ease this process ? ● Workflow engines → automation ● PROV → workflow runs as linked data W3C standards for machine and human readable data on the web. ⭑⭑⭑⭑⭑ : time and expertise !
  • 12. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PROV only 12
  • 13. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PROV only 13 too fine-grained no domain concepts
  • 14. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Provenance as a Linked Experiment Report 14 few + meaningful statements
  • 15. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Problem statement & objectives 15 Problem statement Scientific workflows produce massive raw results. Their publication into curated query-able linked data repositories requires lot of time and expertise. Can we exploit provenance traces to ease the publication of scientific results as Linked Data ?
  • 16. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Problem statement & objectives 16 Problem statement Scientific workflows produce massive raw results. Their publication into curated query-able linked data repositories requires lot of time and expertise. Can we exploit provenance traces to ease the publication of scientific results as Linked Data ? Objectives (1) Leverage annotated workflow patterns to generate provenance mining rules. (2) Refine provenance traces into linked experiment reports.
  • 18. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Approach 18
  • 19. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Input domain-specific annotations (❶,❷) Workflow patterns ❶ 19 Sequence patterns, with possibly intermediate steps ● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar ● EDAM ontology: hasFunction, RNA sequence, Genome map
  • 20. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Input domain-specific annotations (❶,❷) Workflow patterns ❶ 20 Sequence patterns, with possibly intermediate steps ● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar ● EDAM ontology: hasFunction, RNA sequence, Genome map Experiment report template ❷ Link scientific claims, statements, material and methods ● MicroPublication ontology: Material, Method, Claims ● Experimental factor ontology: Transcriptome, Gene expression ● NCBI taxonomy: Homo Sapiens ● Open Annotation model: hasBody, hasTarget
  • 21. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PoeM: generating PrOvEnance Mining rules ❸ 21 (SPARQL Property path) (SPARQL Basic graph pattern) (SPARQL Construct query)
  • 22. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PoeM: sample generated rule ❸ 22 <If> part <Then> part
  • 24. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment context 24 SyMeTRIC: systems medicine project (2015-2017, call “Connect Talent”), funded by the french region Pays de la Loire.
  • 25. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment 25 Material & methods ● Real-life RNA-seq workflow to study 3 mice populations ● WF implemented in Galaxy, run on 2 biological samples ● PROV traces exported from Galaxy Histories (API)
  • 26. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment 26 Material & methods ● Real-life RNA-seq workflow to study 3 mice populations ● WF implemented in Galaxy, run on 2 biological samples ● PROV traces exported from Galaxy Histories (API) Results (for 1 biological sample) ● 60h CPU (12 cores for genome alignment), 21Gb storage ● 3s to export 81 PROV triples from the Galaxy history ● 2s to apply the rule and produce 35 Micropublication triples
  • 27. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Conclusion & perspectives 27 Semi-automated approach (1) PoeM generates semantic web rules (2) PoeM rules applied on PROV traces to assemble linked experiment reports (MicroPublication) Limitations: - Sequence workflow patterns only - SPARQL property paths with complex WF patterns ? - Syntactic matching between WF patterns and PROV labels Usage scenarios: → Query workflow datasets with domain concepts → Populate RDF repositories with WF results
  • 28. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Conclusion & perspectives 28 Future works (1) WF patterns: split-merge, “common motifs” (2) Genericity: other domains / other reports (RO, Nanopub.) (3) PROV heterogeneity: multi-systems PROV (4) Evaluation: involving biologists, at larger scale
  • 29. Questions ? Demo: http://poem.univ-nantes.fr Contact: alban.gaignard@univ-nantes.fr Acknowledgments BiRD bioinformatics facility Connect Talent Call
  • 30. TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 30 Larger scale experiments (PROV traces) 1232 edges
  • 31. TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 31 Larger scale experiments (PROV traces) 49 edges