SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
SHARP: Harmonizing cross-workflow provenance
SeWeBMeDA’17
Alban Gaignard1
, Khalid Belhajjame2
, Hala Skaf-Molli3
May 28, 2017
1
Nantes Academic Hospital, France
2
LAMSADE Paris-Dauphine University, France
3
LS2N - Nantes University, France
Multi-scale biological observations ? data silos !
!
!
!
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
Multi-scale biological observations ? data silos !
!
!
!
Context
• poor interoperability
• data diversity + volume
• multi-site collaborations,
decentralized data
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
Multi-scale biological observations ? data silos !
!
!
!
Context
• poor interoperability
• data diversity + volume
• multi-site collaborations,
decentralized data
Workflows
Modeling complex analysis,
running at large scale
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
Multi-scale biological observations ? data silos !
!
!
!
Context
• poor interoperability
• data diversity + volume
• multi-site collaborations,
decentralized data
Workflows
Modeling complex analysis,
running at large scale
Provenance
Informing on data sources,
production and analysis chains
→ transparency, reproducibility
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
Multiple sites/scientists → multiple workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
go to owl:sameAs
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 2
Multiple sites/scientists → multiple workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
“Which alignment algorithm was used when predicting these
effects ?”
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
Multiple sites/scientists → multiple workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
“Which alignment algorithm was used when predicting these
effects ?”
“A new version of a reference genome is available, which
genome was used when predicting these phenotypes ?”
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
Multiple sites/scientists → multiple workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
“Which alignment algorithm was used when predicting these
effects ?”
“A new version of a reference genome is available, which
genome was used when predicting these phenotypes ?”
Need for an overall tracking of provenance over both Galaxy
and Taverna workflows !
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
PROV-O ontology
https://www.w3.org/TR/prov-o
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 4
Heterogeneity in provenance capture !
Galaxy PROV predicates counts
prov:wasDerivedFrom 118
rdf:type 76
rdfs:label 62
prov:used 61
prov:wasAttributedTo 34
prov:wasGeneratedBy 33
prov:endedAtTime 26
prov:startedAtTime 26
prov:wasAssociatedWith 26
prov:generatedAtTime 1
https://www.w3.org/TR/prov-o
Taverna PROV predicates counts
rdf:type 54
rdfs:label 13
prov:atTime 8
wfprov:describedByParameter 6
rdfs:comment 6
prov:hadRole 6
prov:activity 5
dcterms:hasPart 4
prov:agent 4
prov:endedAtTime 4
prov:hadPlan 4
prov:qualifiedAssociation 4
prov:qualifiedEnd 4
prov:qualifiedStart 4
prov:startedAtTime 4
prov:wasAssociatedWith 4
tavernaprov:content 3
wfprov:usedInput 3
wfprov:wasEnactedBy 3
wfprov:wasOutputFrom 3
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 5
Problem statement
• Heterogeneity in PROV predicates: how to reconcile
different PROV traces ?
• Needs for human-oriented interpretation: tractable size,
domain concepts.
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 6
Approach
1. Applying graph saturation and PROV inferences to
overcome vocabulary heterogeneity.
2. Summarizing harmonized provenance graphs as
life-science experiment reports.
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 7
SHARP: Harmonizing multiple PROV graphs
owl:sameAs
inferred
PROV
PROV
trace
PROV
trace
nanopub
PROV interlinking PROV harmonization PROV summarization
11 12 13
…
14
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 8
Tackling multiple provenance
heterogeneity
– Multi-provenance linking: owl:sameAs
Idea
Stating that two PROV entities associated to the same file
content are the same. go to example
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 9
– Multi-provenance linking: owl:sameAs
Idea
Stating that two PROV entities associated to the same file
content are the same. go to example
Producing owl:sameAs
1. SHA-512 fingerprint of files
2. annotating PROV entities with the SHA-512 digest
3. producing owl:sameAs → SPARQL CONSTRUCT-WHERE query
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 9
— Reasoning: PROV inference regime
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 10
— Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
— Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).
Example
wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a).
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
— Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).
Example
wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a).
Translation in SPARQL
CONSTRUCT {
?e_2 prov:wasGeneratedBy _:blank_node .
_:blank_node prov:used ?e_1
} WHERE { ?e_2 prov:wasDerivedFrom ?e_1 }
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
— Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).
Example
wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a).
Translation in SPARQL
CONSTRUCT {
?e_2 prov:wasGeneratedBy _:blank_node .
_:blank_node prov:used ?e_1
} WHERE { ?e_2 prov:wasDerivedFrom ?e_1 }
Issues
• may produce many non-informative blank nodes for
existential variables.
• may increase the size of the PROV graph.
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
˜ Equality-Generating Dependencies (EGDs)
Idea
• Two blank nodes have the same neighborhood → keep
only one BN
• One blank node and a resource have the same
neighborhood → keep only the resource
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 12
˜ Equality-Generating Dependencies (EGDs)
Idea
• Two blank nodes have the same neighborhood → keep
only one BN
• One blank node and a resource have the same
neighborhood → keep only the resource
Example
prov:qualifiedGeneration
e a
prov:activity
:_gen1
:_gen2
prov:qualifiedGeneration
prov:activity
prov:used a2
prov:qualifiedGeneration
e a
prov:activity
:_gen1
prov:used a2
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 12
˜ Equality-Generating Dependencies (EGDs)
Input : G : the provenance graph resulting from the application of TGD on G
Output: G : the provenance graph with substituted blank nodes, when possible.
1 begin
2 G ← G
3 substitutions ← new List < Pair < Node, Node >> ()
4 repeat
5 S ← findSubstitutions(G )
6 foreach (s ∈ S) do
7 source ← s[0]
8 target ← s[1]
9 foreach (in ∈ G .listStatements(∗, ∗, source)) do
10 G ← G .add(in.getSubject(), in.getPredicate(), target)
11 G ← G .del(in)
12 foreach (out ∈ G .listStatements(source, ∗, ∗)) do
13 G ← G .add(target, out.getPredicate(), out.getObject())
14 G ← G .del(out)
15 until (S.size() = 0)
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 13
Answering domain-specific
questions
™ From harmonized PROV to Nanopublications
Objectives
Linking data to scientific context and publishing them as
cite-able and exchangeable experiment reports.
→ Semantic science Integrated Ontology + NanoPublications
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 14
™ From harmonized PROV to Nanopublications
Objectives
Linking data to scientific context and publishing them as
cite-able and exchangeable experiment reports.
→ Semantic science Integrated Ontology + NanoPublications
reference
genome
sample_001
predicted
phenotypes
from exons
sio:has-phenotype
sio:is-variant-of
sio:is-supported-by
sio:is-supported-by
sio:is-supported-by
sio:is-supported-by
scientific
question :head {
ex:pub1 a np:Nanopublication .
ex:pub1 np:hasAssertion :assertion1 ;
np:hasAssertion :assertion2 .
ex:pub1 np:hasProvenance :provenance .
ex:pub1 np:hasPublicationInfo :pubInfo . }
:assertion1 {
ex:question a sio:Question ;
sio:has-value "What are the effects of SNPs
located in exons for study-Y samples" ;
sio:is-supported-by ex:referenceGenome ;
sio:is-supported-by ex:sample_001 ;
sio:is-supported-by ex:annotatedVariants . }
:assertion2 {
ex:referenceGenome a sio:Genome .
ex:sample_001 a sio:Sample ;
sio:is-variant-of ex:referenceGenome ;
sio:has-phenotype ex:annotatedVariants . }
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 14
™ From harmonized PROV to Nanopublications
How ?
By identifying a data lineage path in multiple PROV graphs,
beforehand harmonized (inferred prov:wasInfluencedBy).
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 15
™ From harmonized PROV to Nanopublications
How ?
By identifying a data lineage path in multiple PROV graphs,
beforehand harmonized (inferred prov:wasInfluencedBy).
CONSTRUCT {
GRAPH :assertion {
?ref_genome a sio:Genome .
?sample a sio:Sample ;
sio:is-variant-of ?ref_genome ;
sio:has-phenotype ?out .
?out rdfs:label ?out_label .
?out sio:is-supported-by ?ref_genome . }
} WHERE {
?sample rdfs:label ?sample_label.
FILTER (contains(lcase(str(?sample_label)), lcase("fastq"))) .
?ref_genome rdfs:label ?ref_genome_label.
FILTER (contains(lcase(str(?ref_genome_label)), lcase("GRCh"))) .
?out ( prov:wasInfluencedBy )+ ?sample
?out tavernaprov:content ?out_label .
FILTER (contains(lcase(str(?out_label)), lcase("exons"))) .
}
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 15
Experimental results
Time and space for multiple PROV harmonization ?
Material & methods
• 3 ProvStore graphs
• 10, 109 and 666 edges
• Java + Jena rule engine
• Laptop, 4 cores, 16GB, 5
runs
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 16
Time and space for multiple PROV harmonization ?
Material & methods
• 3 ProvStore graphs
• 10, 109 and 666 edges
• Java + Jena rule engine
• Laptop, 4 cores, 16GB, 5
runs
Results
• almost 5s for the whole harmonization process
• no new wasDerivedFrom relation (must be captured)
• inferred wasInfluencedBy → “common denominator” for
data lineage
∆ size ∆ BNodes ∆ wDerivedFrom ∆ wInfluencedBy mean time
PA + 2776 + 2 + 1 + 7 4835 ± 343 ms
PB + 3102 + 4 + 1 + 58 4759 ± 71 ms
PC + 5023 + 63 + 1 + 231 5304 ± 176 ms
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 16
Can SHARP produce human-oriented experiment reports ?
Target question
“Which reference genome was used when annotating the
genetic variants ?”
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 17
Can SHARP produce human-oriented experiment reports ?
Target question
“Which reference genome was used when annotating the
genetic variants ?”
Material & methods
• Galaxy PROV trace
• Taverna PROV trace
• Manually provided
owl:sameAs
• Harmonization process
• Summarization query
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 17
Can SHARP produce human-oriented experiment reports ?
Harmonization results
Galaxy PROV Taverna PROV Harmonized PROV
predicates counts predicates counts predicates counts
prov:wasDer.From 118 rdf:type 54 owl:differentFrom 3617
rdf:type 76 rdfs:label 13 rdf:type 958
rdfs:label 62 prov:atTime 8 prov:wasInflu.By 515
prov:used 61 wfprov:descByParam. 6 prov:influenced 291
prov:wasAttr.To 34 rdfs:comment 6 rdfs:seeAlso 268
prov:wasGen.By 33 prov:hadRole 6 rdfs:subClassOf 223
prov:endedAtTime 26 prov:activity 5 owl:disjointWith 218
prov:startedAtTime 26 purl:hasPart 4 rdfs:range 208
prov:wasAsso.With 26 prov:agent 4 rdfs:domain 199
prov:gen.AtTime 1 prov:endedAtTime 4 prov:wasGen.By 172
all 463 all 177 all 8654
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 18
Can SHARP produce human-oriented experiment reports ?
Nanopublication as summarization result
“Which reference genome was used when predicting the phenotypes ?”
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 19
Conclusion & future works
Conclusion & future works
Achievements
• Reasoning to address PROV capture heterogeneity
• Reconciliation of multi-site workflow provenance
• Human-oriented nanopublications: domain-specific
vocabulary, tractable size
• Open source implementation for all PROV inference rules
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 20
Conclusion & future works
Achievements
• Reasoning to address PROV capture heterogeneity
• Reconciliation of multi-site workflow provenance
• Human-oriented nanopublications: domain-specific
vocabulary, tractable size
• Open source implementation for all PROV inference rules
Perspectives
• size of inferred graphs: OWL/PROV entailment subsets ?
• not yet automated summarization query → work in progress
• towards decentralized provenance harmonization
(federated querying)
• “user study” to assess better interpretation, sharing, trust
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 20
Questions ?
alban.gaignard@univ-nantes.fr
Acknowledgments

Contenu connexe

Similaire à SHARP: Harmonizing cross-workflow Provenance (6)

THoSP: an Algorithm for Nesting Property Graphs
THoSP: an Algorithm for Nesting Property GraphsTHoSP: an Algorithm for Nesting Property Graphs
THoSP: an Algorithm for Nesting Property Graphs
 
SSB 2018 Comparative Phylogeography Workshop
SSB 2018 Comparative Phylogeography WorkshopSSB 2018 Comparative Phylogeography Workshop
SSB 2018 Comparative Phylogeography Workshop
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Peter krusche population based targeted validation of structural variant brea...
Peter krusche population based targeted validation of structural variant brea...Peter krusche population based targeted validation of structural variant brea...
Peter krusche population based targeted validation of structural variant brea...
 
Soft Shake Event / A soft introduction to Neo4J
Soft Shake Event / A soft introduction to Neo4JSoft Shake Event / A soft introduction to Neo4J
Soft Shake Event / A soft introduction to Neo4J
 

Plus de Syed Muhammad Ali Hasnain

Fair data vs 5 star open data final
Fair data vs 5 star open data finalFair data vs 5 star open data final
Fair data vs 5 star open data final
Syed Muhammad Ali Hasnain
 

Plus de Syed Muhammad Ali Hasnain (10)

Fair data vs 5 star open data final
Fair data vs 5 star open data finalFair data vs 5 star open data final
Fair data vs 5 star open data final
 
Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...
 
Exploiting Cognitive Computing and Frame Semantic Features for Biomedical Doc...
Exploiting Cognitive Computing and Frame Semantic Features for Biomedical Doc...Exploiting Cognitive Computing and Frame Semantic Features for Biomedical Doc...
Exploiting Cognitive Computing and Frame Semantic Features for Biomedical Doc...
 
An Approach for Discovering and Exploring Semantic Relationships between Genes
An Approach for Discovering and Exploring Semantic Relationships between GenesAn Approach for Discovering and Exploring Semantic Relationships between Genes
An Approach for Discovering and Exploring Semantic Relationships between Genes
 
Federated Query Formulation and Processing through BioFed
Federated Query Formulation and Processing through BioFedFederated Query Formulation and Processing through BioFed
Federated Query Formulation and Processing through BioFed
 
Processing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesProcessing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web Technologies
 
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data CloudA Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
 
Improving discovery in Life Sciences Linked Open Data Cloud
Improving discovery in Life Sciences Linked Open Data CloudImproving discovery in Life Sciences Linked Open Data Cloud
Improving discovery in Life Sciences Linked Open Data Cloud
 
Knowledge Processing with Big Data and Semantic Web Technologies
Knowledge Processing with Big Data and  Semantic Web TechnologiesKnowledge Processing with Big Data and  Semantic Web Technologies
Knowledge Processing with Big Data and Semantic Web Technologies
 
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and ExecutionFedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
 

Dernier

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 

Dernier (20)

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 

SHARP: Harmonizing cross-workflow Provenance

  • 1. SHARP: Harmonizing cross-workflow provenance SeWeBMeDA’17 Alban Gaignard1 , Khalid Belhajjame2 , Hala Skaf-Molli3 May 28, 2017 1 Nantes Academic Hospital, France 2 LAMSADE Paris-Dauphine University, France 3 LS2N - Nantes University, France
  • 2. Multi-scale biological observations ? data silos ! ! ! ! A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
  • 3. Multi-scale biological observations ? data silos ! ! ! ! Context • poor interoperability • data diversity + volume • multi-site collaborations, decentralized data A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
  • 4. Multi-scale biological observations ? data silos ! ! ! ! Context • poor interoperability • data diversity + volume • multi-site collaborations, decentralized data Workflows Modeling complex analysis, running at large scale A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
  • 5. Multi-scale biological observations ? data silos ! ! ! ! Context • poor interoperability • data diversity + volume • multi-site collaborations, decentralized data Workflows Modeling complex analysis, running at large scale Provenance Informing on data sources, production and analysis chains → transparency, reproducibility A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
  • 6. Multiple sites/scientists → multiple workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 go to owl:sameAs A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 2
  • 7. Multiple sites/scientists → multiple workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 “Which alignment algorithm was used when predicting these effects ?” A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
  • 8. Multiple sites/scientists → multiple workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 “Which alignment algorithm was used when predicting these effects ?” “A new version of a reference genome is available, which genome was used when predicting these phenotypes ?” A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
  • 9. Multiple sites/scientists → multiple workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 “Which alignment algorithm was used when predicting these effects ?” “A new version of a reference genome is available, which genome was used when predicting these phenotypes ?” Need for an overall tracking of provenance over both Galaxy and Taverna workflows ! A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
  • 10. PROV-O ontology https://www.w3.org/TR/prov-o A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 4
  • 11. Heterogeneity in provenance capture ! Galaxy PROV predicates counts prov:wasDerivedFrom 118 rdf:type 76 rdfs:label 62 prov:used 61 prov:wasAttributedTo 34 prov:wasGeneratedBy 33 prov:endedAtTime 26 prov:startedAtTime 26 prov:wasAssociatedWith 26 prov:generatedAtTime 1 https://www.w3.org/TR/prov-o Taverna PROV predicates counts rdf:type 54 rdfs:label 13 prov:atTime 8 wfprov:describedByParameter 6 rdfs:comment 6 prov:hadRole 6 prov:activity 5 dcterms:hasPart 4 prov:agent 4 prov:endedAtTime 4 prov:hadPlan 4 prov:qualifiedAssociation 4 prov:qualifiedEnd 4 prov:qualifiedStart 4 prov:startedAtTime 4 prov:wasAssociatedWith 4 tavernaprov:content 3 wfprov:usedInput 3 wfprov:wasEnactedBy 3 wfprov:wasOutputFrom 3 A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 5
  • 12. Problem statement • Heterogeneity in PROV predicates: how to reconcile different PROV traces ? • Needs for human-oriented interpretation: tractable size, domain concepts. A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 6
  • 13. Approach 1. Applying graph saturation and PROV inferences to overcome vocabulary heterogeneity. 2. Summarizing harmonized provenance graphs as life-science experiment reports. A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 7
  • 14. SHARP: Harmonizing multiple PROV graphs owl:sameAs inferred PROV PROV trace PROV trace nanopub PROV interlinking PROV harmonization PROV summarization 11 12 13 … 14 A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 8
  • 16. – Multi-provenance linking: owl:sameAs Idea Stating that two PROV entities associated to the same file content are the same. go to example A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 9
  • 17. – Multi-provenance linking: owl:sameAs Idea Stating that two PROV entities associated to the same file content are the same. go to example Producing owl:sameAs 1. SHA-512 fingerprint of files 2. annotating PROV entities with the SHA-512 digest 3. producing owl:sameAs → SPARQL CONSTRUCT-WHERE query A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 9
  • 18. — Reasoning: PROV inference regime A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 10
  • 19. — Tuple-Generating Dependencies (TGDs) Idea Repeatedly inferring new facts based on existing ones until saturation (production rules / forward chaining). A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
  • 20. — Tuple-Generating Dependencies (TGDs) Idea Repeatedly inferring new facts based on existing ones until saturation (production rules / forward chaining). Example wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a). A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
  • 21. — Tuple-Generating Dependencies (TGDs) Idea Repeatedly inferring new facts based on existing ones until saturation (production rules / forward chaining). Example wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a). Translation in SPARQL CONSTRUCT { ?e_2 prov:wasGeneratedBy _:blank_node . _:blank_node prov:used ?e_1 } WHERE { ?e_2 prov:wasDerivedFrom ?e_1 } A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
  • 22. — Tuple-Generating Dependencies (TGDs) Idea Repeatedly inferring new facts based on existing ones until saturation (production rules / forward chaining). Example wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a). Translation in SPARQL CONSTRUCT { ?e_2 prov:wasGeneratedBy _:blank_node . _:blank_node prov:used ?e_1 } WHERE { ?e_2 prov:wasDerivedFrom ?e_1 } Issues • may produce many non-informative blank nodes for existential variables. • may increase the size of the PROV graph. A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
  • 23. ˜ Equality-Generating Dependencies (EGDs) Idea • Two blank nodes have the same neighborhood → keep only one BN • One blank node and a resource have the same neighborhood → keep only the resource A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 12
  • 24. ˜ Equality-Generating Dependencies (EGDs) Idea • Two blank nodes have the same neighborhood → keep only one BN • One blank node and a resource have the same neighborhood → keep only the resource Example prov:qualifiedGeneration e a prov:activity :_gen1 :_gen2 prov:qualifiedGeneration prov:activity prov:used a2 prov:qualifiedGeneration e a prov:activity :_gen1 prov:used a2 A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 12
  • 25. ˜ Equality-Generating Dependencies (EGDs) Input : G : the provenance graph resulting from the application of TGD on G Output: G : the provenance graph with substituted blank nodes, when possible. 1 begin 2 G ← G 3 substitutions ← new List < Pair < Node, Node >> () 4 repeat 5 S ← findSubstitutions(G ) 6 foreach (s ∈ S) do 7 source ← s[0] 8 target ← s[1] 9 foreach (in ∈ G .listStatements(∗, ∗, source)) do 10 G ← G .add(in.getSubject(), in.getPredicate(), target) 11 G ← G .del(in) 12 foreach (out ∈ G .listStatements(source, ∗, ∗)) do 13 G ← G .add(target, out.getPredicate(), out.getObject()) 14 G ← G .del(out) 15 until (S.size() = 0) A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 13
  • 27. ™ From harmonized PROV to Nanopublications Objectives Linking data to scientific context and publishing them as cite-able and exchangeable experiment reports. → Semantic science Integrated Ontology + NanoPublications A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 14
  • 28. ™ From harmonized PROV to Nanopublications Objectives Linking data to scientific context and publishing them as cite-able and exchangeable experiment reports. → Semantic science Integrated Ontology + NanoPublications reference genome sample_001 predicted phenotypes from exons sio:has-phenotype sio:is-variant-of sio:is-supported-by sio:is-supported-by sio:is-supported-by sio:is-supported-by scientific question :head { ex:pub1 a np:Nanopublication . ex:pub1 np:hasAssertion :assertion1 ; np:hasAssertion :assertion2 . ex:pub1 np:hasProvenance :provenance . ex:pub1 np:hasPublicationInfo :pubInfo . } :assertion1 { ex:question a sio:Question ; sio:has-value "What are the effects of SNPs located in exons for study-Y samples" ; sio:is-supported-by ex:referenceGenome ; sio:is-supported-by ex:sample_001 ; sio:is-supported-by ex:annotatedVariants . } :assertion2 { ex:referenceGenome a sio:Genome . ex:sample_001 a sio:Sample ; sio:is-variant-of ex:referenceGenome ; sio:has-phenotype ex:annotatedVariants . } A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 14
  • 29. ™ From harmonized PROV to Nanopublications How ? By identifying a data lineage path in multiple PROV graphs, beforehand harmonized (inferred prov:wasInfluencedBy). A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 15
  • 30. ™ From harmonized PROV to Nanopublications How ? By identifying a data lineage path in multiple PROV graphs, beforehand harmonized (inferred prov:wasInfluencedBy). CONSTRUCT { GRAPH :assertion { ?ref_genome a sio:Genome . ?sample a sio:Sample ; sio:is-variant-of ?ref_genome ; sio:has-phenotype ?out . ?out rdfs:label ?out_label . ?out sio:is-supported-by ?ref_genome . } } WHERE { ?sample rdfs:label ?sample_label. FILTER (contains(lcase(str(?sample_label)), lcase("fastq"))) . ?ref_genome rdfs:label ?ref_genome_label. FILTER (contains(lcase(str(?ref_genome_label)), lcase("GRCh"))) . ?out ( prov:wasInfluencedBy )+ ?sample ?out tavernaprov:content ?out_label . FILTER (contains(lcase(str(?out_label)), lcase("exons"))) . } A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 15
  • 32. Time and space for multiple PROV harmonization ? Material & methods • 3 ProvStore graphs • 10, 109 and 666 edges • Java + Jena rule engine • Laptop, 4 cores, 16GB, 5 runs A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 16
  • 33. Time and space for multiple PROV harmonization ? Material & methods • 3 ProvStore graphs • 10, 109 and 666 edges • Java + Jena rule engine • Laptop, 4 cores, 16GB, 5 runs Results • almost 5s for the whole harmonization process • no new wasDerivedFrom relation (must be captured) • inferred wasInfluencedBy → “common denominator” for data lineage ∆ size ∆ BNodes ∆ wDerivedFrom ∆ wInfluencedBy mean time PA + 2776 + 2 + 1 + 7 4835 ± 343 ms PB + 3102 + 4 + 1 + 58 4759 ± 71 ms PC + 5023 + 63 + 1 + 231 5304 ± 176 ms A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 16
  • 34. Can SHARP produce human-oriented experiment reports ? Target question “Which reference genome was used when annotating the genetic variants ?” A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 17
  • 35. Can SHARP produce human-oriented experiment reports ? Target question “Which reference genome was used when annotating the genetic variants ?” Material & methods • Galaxy PROV trace • Taverna PROV trace • Manually provided owl:sameAs • Harmonization process • Summarization query A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 17
  • 36. Can SHARP produce human-oriented experiment reports ? Harmonization results Galaxy PROV Taverna PROV Harmonized PROV predicates counts predicates counts predicates counts prov:wasDer.From 118 rdf:type 54 owl:differentFrom 3617 rdf:type 76 rdfs:label 13 rdf:type 958 rdfs:label 62 prov:atTime 8 prov:wasInflu.By 515 prov:used 61 wfprov:descByParam. 6 prov:influenced 291 prov:wasAttr.To 34 rdfs:comment 6 rdfs:seeAlso 268 prov:wasGen.By 33 prov:hadRole 6 rdfs:subClassOf 223 prov:endedAtTime 26 prov:activity 5 owl:disjointWith 218 prov:startedAtTime 26 purl:hasPart 4 rdfs:range 208 prov:wasAsso.With 26 prov:agent 4 rdfs:domain 199 prov:gen.AtTime 1 prov:endedAtTime 4 prov:wasGen.By 172 all 463 all 177 all 8654 A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 18
  • 37. Can SHARP produce human-oriented experiment reports ? Nanopublication as summarization result “Which reference genome was used when predicting the phenotypes ?” A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 19
  • 39. Conclusion & future works Achievements • Reasoning to address PROV capture heterogeneity • Reconciliation of multi-site workflow provenance • Human-oriented nanopublications: domain-specific vocabulary, tractable size • Open source implementation for all PROV inference rules A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 20
  • 40. Conclusion & future works Achievements • Reasoning to address PROV capture heterogeneity • Reconciliation of multi-site workflow provenance • Human-oriented nanopublications: domain-specific vocabulary, tractable size • Open source implementation for all PROV inference rules Perspectives • size of inferred graphs: OWL/PROV entailment subsets ? • not yet automated summarization query → work in progress • towards decentralized provenance harmonization (federated querying) • “user study” to assess better interpretation, sharing, trust A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 20