Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
From Scientific Workflow Patterns
to 5-star Linked Open Data
Alban Gaignard Hala Skaf-Molli Audrey Bihouée
Nantes Academic...
Needs for linked experiment reports
2
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align mu...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align mu...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align mu...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align mu...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing experiment results
Scientific experiment: RNA sequen...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing experiment results
Scientific experiment: RNA sequen...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Expected result: human+machine tractable reports
9
Annotated “Material & ...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
5-star Linked Open Data
10
W3C standards for machine and human
readable d...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
5-star Linked Open Data
11
How to ease this process ?
● Workflow engines ...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PROV only
12
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PROV only
13
too fine-grained
no domain concepts
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Provenance as a Linked Experiment Report
14
few + meaningful statements
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Problem statement & objectives
15
Problem statement
Scientific workflows ...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Problem statement & objectives
16
Problem statement
Scientific workflows ...
Rules generation
17
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Approach
18
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Input domain-specific annotations (❶,❷)
Workflow patterns ❶
19
Sequence p...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Input domain-specific annotations (❶,❷)
Workflow patterns ❶
20
Sequence p...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PoeM: generating PrOvEnance Mining rules ❸
21
(SPARQL Property path)
(SPA...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PoeM: sample generated rule ❸
22
<If> part
<Then> part
First experiments &
results
23
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment context
24
SyMeTRIC: systems medicine project (2015-2017, call...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment
25
Material & methods
● Real-life RNA-seq workflow to study 3 ...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment
26
Material & methods
● Real-life RNA-seq workflow to study 3 ...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Conclusion & perspectives
27
Semi-automated approach
(1) PoeM generates s...
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Conclusion & perspectives
28
Future works
(1) WF patterns: split-merge, “...
Questions ?
Demo: http://poem.univ-nantes.fr
Contact: alban.gaignard@univ-nantes.fr
Acknowledgments
BiRD bioinformatics fa...
TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 30
Larger scale experiments (PROV traces)
1232 edges
TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 31
Larger scale experiments (PROV traces)
49 edges
Prochain SlideShare
Chargement dans…5
×

PoemTapp16

190 vues

Publié le

We address the issue of sharing massively produced data by reusing Linked Open Vocabularies. Our in progress work relies on workflow patterns and rules generator. The rules mine generic provenance metadata and produce domain-specific data annotations. We propose a method for populating reusable 5-star biomedical experiment reports repositories with a reduced data curation cost.

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

PoemTapp16

  1. 1. From Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic Hospital (CHU de Nantes), CNRS, France LINA, Nantes University, CNRS, France Institut du Thorax, Nantes University, INSERM, CNRS, France 8th USENIX workshop on Theory and Practice of Provenance (TaPP’16) Washington DC
  2. 2. Needs for linked experiment reports 2
  3. 3. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 3
  4. 4. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 4 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  5. 5. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 5 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  6. 6. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 6 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb Challenges Algorithmic performance, storage, preservation, reuse (limit recompute) & share. 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  7. 7. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing experiment results Scientific experiment: RNA sequencing to quantify gene expression levels under multiple biological conditions. 7
  8. 8. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing experiment results Scientific experiment: RNA sequencing to quantify gene expression levels under multiple biological conditions. 8 Need for scientific context : metadata
  9. 9. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Expected result: human+machine tractable reports 9 Annotated “Material & Methods” Links to some workflow artifacts (algorithms, data)
  10. 10. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée 5-star Linked Open Data 10 W3C standards for machine and human readable data on the web. ⭑⭑⭑⭑⭑ : time and expertise !
  11. 11. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée 5-star Linked Open Data 11 How to ease this process ? ● Workflow engines → automation ● PROV → workflow runs as linked data W3C standards for machine and human readable data on the web. ⭑⭑⭑⭑⭑ : time and expertise !
  12. 12. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PROV only 12
  13. 13. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PROV only 13 too fine-grained no domain concepts
  14. 14. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Provenance as a Linked Experiment Report 14 few + meaningful statements
  15. 15. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Problem statement & objectives 15 Problem statement Scientific workflows produce massive raw results. Their publication into curated query-able linked data repositories requires lot of time and expertise. Can we exploit provenance traces to ease the publication of scientific results as Linked Data ?
  16. 16. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Problem statement & objectives 16 Problem statement Scientific workflows produce massive raw results. Their publication into curated query-able linked data repositories requires lot of time and expertise. Can we exploit provenance traces to ease the publication of scientific results as Linked Data ? Objectives (1) Leverage annotated workflow patterns to generate provenance mining rules. (2) Refine provenance traces into linked experiment reports.
  17. 17. Rules generation 17
  18. 18. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Approach 18
  19. 19. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Input domain-specific annotations (❶,❷) Workflow patterns ❶ 19 Sequence patterns, with possibly intermediate steps ● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar ● EDAM ontology: hasFunction, RNA sequence, Genome map
  20. 20. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Input domain-specific annotations (❶,❷) Workflow patterns ❶ 20 Sequence patterns, with possibly intermediate steps ● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar ● EDAM ontology: hasFunction, RNA sequence, Genome map Experiment report template ❷ Link scientific claims, statements, material and methods ● MicroPublication ontology: Material, Method, Claims ● Experimental factor ontology: Transcriptome, Gene expression ● NCBI taxonomy: Homo Sapiens ● Open Annotation model: hasBody, hasTarget
  21. 21. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PoeM: generating PrOvEnance Mining rules ❸ 21 (SPARQL Property path) (SPARQL Basic graph pattern) (SPARQL Construct query)
  22. 22. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PoeM: sample generated rule ❸ 22 <If> part <Then> part
  23. 23. First experiments & results 23
  24. 24. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment context 24 SyMeTRIC: systems medicine project (2015-2017, call “Connect Talent”), funded by the french region Pays de la Loire.
  25. 25. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment 25 Material & methods ● Real-life RNA-seq workflow to study 3 mice populations ● WF implemented in Galaxy, run on 2 biological samples ● PROV traces exported from Galaxy Histories (API)
  26. 26. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment 26 Material & methods ● Real-life RNA-seq workflow to study 3 mice populations ● WF implemented in Galaxy, run on 2 biological samples ● PROV traces exported from Galaxy Histories (API) Results (for 1 biological sample) ● 60h CPU (12 cores for genome alignment), 21Gb storage ● 3s to export 81 PROV triples from the Galaxy history ● 2s to apply the rule and produce 35 Micropublication triples
  27. 27. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Conclusion & perspectives 27 Semi-automated approach (1) PoeM generates semantic web rules (2) PoeM rules applied on PROV traces to assemble linked experiment reports (MicroPublication) Limitations: - Sequence workflow patterns only - SPARQL property paths with complex WF patterns ? - Syntactic matching between WF patterns and PROV labels Usage scenarios: → Query workflow datasets with domain concepts → Populate RDF repositories with WF results
  28. 28. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Conclusion & perspectives 28 Future works (1) WF patterns: split-merge, “common motifs” (2) Genericity: other domains / other reports (RO, Nanopub.) (3) PROV heterogeneity: multi-systems PROV (4) Evaluation: involving biologists, at larger scale
  29. 29. Questions ? Demo: http://poem.univ-nantes.fr Contact: alban.gaignard@univ-nantes.fr Acknowledgments BiRD bioinformatics facility Connect Talent Call
  30. 30. TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 30 Larger scale experiments (PROV traces) 1232 edges
  31. 31. TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 31 Larger scale experiments (PROV traces) 49 edges

×