SlideShare une entreprise Scribd logo
1  sur  27
Paolo Missier
School of Computing
Newcastle University, UK
1st CMLS workshop @ER 2020
November 4, 2020
ReComp: optimising the re-execution of
analytics pipelines in response to changes in
the data
2
Topics
Data evolution and its impact on data-driven processes
The ProvONE provenance model and role in selective process re-run
ReComp outlook: from source queries to analytics outcomes
CMLSworkshop2020–P.Missier
3
Context
Big
Data
The Big
Analytics
Machine
Actionable
Knowledge
Analytics
Data Science over time V3
V2
V1
Meta-knowledge
Algorithms
Tools
Libraries
Reference
datasets
t
t
t
CMLSworkshop2020–P.Missier
4
CMLSworkshop2020–P.Missier
Reacting to changes in inputs
1. Always refresh
2. Approximate
x1
x2
y1
d1 d2
f()
5
A refresh-if-needed approachCMLSworkshop2020–P.Missier
6
Case study: genomics
Image credits: Broad Institute https://software.broadinstitute.org/gatk/
https://www.genomicsengland.co.uk/the-100000-genomes-project/
Spark GATK tools on Azure:
45 mins / GB
@ 13GB / exome: about 10 hours
Variant calling / interpretation
CMLSworkshop2020–P.Missier
7
Genomics: WES / WGS, Variant calling  Variant interpretation
SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh,
M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer
SVI: Simple Variant Interpretation
Variant classification : pathogenic, benign and unknown/uncertain
CMLSworkshop2020–P.Missier
8
Reacting to changes in inputs
Refresh heuristics:
1. Changes in patient’s phenotype  Refresh
2. Changes in patient’s variants:
- New variants appear  run SVI on those new variants only
- Variants removed (rare)  remove those from previous SVI output
CMLSworkshop2020–P.Missier
The function computed by the workflow is not stable!
Small input perturbation  possibly large impact
9
Changes in data dependencies
<top-k annotated variants> = f(varset, phenotype, ClinVar, GeneMap)
Evolution in number of variants that affect patients
(a) with a specific phenotype
(b) Across all phenotypes
ClinVar
- Monthly updates
CMLSworkshop2020–P.Missier
<eventname>
Changes in variants: ViruSurf
Frequence of single mutations in sequences collected from the population, taken at different time points
ViruSurf ViruSurf-GISAID ViruSurf ViruSurf-GISAID
≤ 31/03/2020 ≥ 01/04/2020
With D614G 6,592 15034 23,649 18,421
Without D614G 4,664 8821 3,331 3369
D614% 58.56% 63.02% 87.65% 84.54%
Total @ AUGUST 61.59% 86.26%
Total @ OCTOBER 42.38% 90.00%
Frequence of variants with high transmissibility
Numbers change because:
i) new knowledge arises from
literature,
ii) variants are re-computed by
eliminating low quality
sequences/sub-sequences,
iii) more sequences are submitted
from laboratories…
Canakoglu A., Pinoli P., Bernasconi A., Alfonsi T., Melidis D.P., Ceri
S. ViruSurf: an integrated database to investigate viral
sequences. Nucleic Acids Research, 2020.
https://doi.org/10.1093/nar/gkaa846
Bernasconi A., Canakoglu A., Pinoli P., Ceri S. Empowering Virus
Sequence Research through Conceptual Modeling. In
Proceedings of the 39th International Conference on Conceptual
Modeling (ER 2020).
11
Diff functions for SVI: ClinVar
ClinVar 1/2016 ClinVar 1/2017
diff
(unchanged)
Relational data  simple set difference
The ClinVar dataset: 30 columns
Changes:
Records: 349,074  543,841
Added 200,746 Removed 5,979. Updated
27,662
CMLSworkshop2020–P.Missier
13
Reacting to changes in data dependencies
Impact: “Any variant with status moving from/to Red causes High impact on any
patient who is affected by the variant”
Observation: Variants v within output set y that are in scope for patient X remain in scope! (monotonicity)
1. Variant v changes status
- unknown  benign
- unknown  deleterious
2. Brand new variant
 If in scope:
compare status before / after inexpensive
 recompute SVI on all inputs expensive
Scope: which cases are affected? “a change in variant v can only have impact on a case
X if V and X share the same phenotype”
CMLSworkshop2020–P.Missier
14
SVI is unstable
Sparsity issue:
• About 500 executions
• 33 patients
• total runtime about 60 hours
• Only 14 relevant output changes
detected
4.2 hours of computation per change
Should we care about updates?
Evolving knowledge about
gene variations
CMLSworkshop2020–P.Missier
15
Empirical evaluation
re-executions 495  71 Ideal: 14
But: no false negatives
CMLSworkshop2020–P.Missier
16
General Problem formulation
d: current state of data source D
Changes in d:
Problem: for each x in X
estimate impact without computing f(x,d)
(assuming f is not stable)
Scope of impact:
CMLSworkshop2020–P.Missier
Impact due to changes in d:
Impact estimation:
17
Diff and impact functions for white box process
- Data-specific
- Process-specificomim
clinvar
Overall
impact
impact on the SVI output
impact on ‘p1 select genes’
CMLSworkshop2020–P.Missier
18
White box ReComp and Process Provenance
White box process structure 
detailed history of each execution 
more granular optimisations become possible
Change in
ClinVar
Change in
GeneMap
CMLSworkshop2020–P.Missier
19
PROV (2013)CMLSworkshop2020–P.Missier
https://www.w3.org/TR/prov-dm/
20
Prov + process structure = ProvONE
User Execution
«Association » «Usage» «Generation »
«Entity»
«Collection»
Controller Program
Workflow Channel
Port
wasPartOf
«hadMember »
«wasDerivedFrom »
hasSubProgram
«hadPlan »
controlledBy
controls[*]
[*]
[*]
[*] [*] [*]
«wasDerivedFrom »
[*][*]
[0..1]
[0..1]
[0..1]
[*][1]
[*]
[*]
[0..1]
[0..1]
hasOutPort [*][0..1]
[1]
«wasAssociatedWith »
«agent »
[1]
[0..1]
[*]
[*]
[*] [*]
[*] [*]
[*]
[*] [*]
[*]
[*]
[*]
[0..1]
[0..1]
hasInPort [*][0..1]
connectsTo
[*]
[0..1]
«wasInformedBy »
[*][1]
«wasGeneratedBy »
«qualifiedGeneration »
«qualifiedUsage »
«qualifiedAssociation »
hadEntity
«used »
hadOutPorthadInPort
[*][1]
[1] [1]
[1] [1]
hadEntity
hasDefaultParam
CMLSworkshop2020–P.Missier
21
Workflow Provenance: example
ProvONE combines process structure with runtime provenance
“plan”
“plan
execution”
WF
B1
B2
B1exec B3exec
D
WFexec
wasPartOf
wasPartOf
usedwasGeneratedBy
wasAssociatedWith
dataSource
usage
ProgramWorkflow
Execution
Entity
(ref data)
O1
B3
hasSubProgram
O2
I1
I2
wasAssociatedWith(wfexec, wf)
wasAssociatedWith(B1exec, B1)
wasAssociatedWith(B3exec, B3)
wasPartOf(B1exec, wfexec)
wasPartOf(B3exec, wfexec)
wasGeneratedBy(g,D,B1exec)
used(u,B3exec,D)
hadInPort(u, I2)
hadOutPort(g, O2)
wf isa Workflow
hasSubProgram(wf, B1)
hasSubProgram(wf, B2)
hasSubProgram(wf, B3)
hasOutPort(B3,O1)
hasInPort(B3,I2)
connectsTo(O2, c)
connectsTo(I2, c)
… wasAssociatedWith
wasAssociatedWith
B1exec is an instance of program B1, B3exec is an instance of program B3, which are part of Workflow wf
B3 has input port I2; B1 has output port O2, and these are connected (through channel c)
Data item D was generated by B1exec on port O2 and used by B3exec on port I2
CMLSworkshop2020–P.Missier
22
ProvONE enables selective re-execution
ProvONE traces the execution of nested workflows
Enables selective re-computation of the fragments affected by a change
CMLSworkshop2020–P.Missier
23
Summary: the ReComp meta-process
History
DB
Detect and
quantify
changes
data diff(d,d’)
Record execution
history
Analytics
Process P
Log / provenance
Partially
Re-exec
P (D) P(D’)
Change
Events
Changes:
• Reference datasets
• Inputs
For each past
instances:
Estimate impact of
changes
Impact(dd’, o) impact estimation functions
Scope
Select relevant
sub-processes
Optimisation
CMLSworkshop2020–P.Missier
24
End-to-end ReComp
ObserveControl
ReComp
CMLSworkshop2020–P.Missier
- When should I rerun a query?
- Which data changes should I react to?
integration
Queries
The Genometric
query language. (*)
(*) Masseroli M, Canakoglu A, Pinoli P, Kaitoua A, Gulino A, Horlova O, Nanni L, Bernasconi A, Perna S, Stamoulakatou E, Ceri S. Processing
of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data. Bioinformatics, 2018.
https://doi.org/10.1093/bioinformatics/bty688
<eventname>
The GenoMetric Query Language
Masseroli M, Canakoglu A, Pinoli P, Kaitoua A, Gulino A, Horlova O, Nanni L, Bernasconi A, Perna S, Stamoulakatou E, Ceri S. Processing of
big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data. Bioinformatics, 2018.
https://doi.org/10.1093/bioinformatics/bty688
A query processing environment for
supporting, at a high level of abstraction,
data extraction and the most common
data-driven computations required by
tertiary data analysis of Next Generation
Sequencing datasets
<eventname>
ReComp for GenoMetric Query Language 1
For kidney cancers,
find mutations and
their number in each exon
_2017_11
Different somatic mutation datasets at different time
points
Different annotation files are released periodically
Recomputation of the query may
only be required if we can
estimate that > 10% samples
would obtain a changed count
<eventname>
ReComp for GenoMetric Query Language 3
Find distal bindings in transcription regulatory regions: Find all enriched regions (peaks) of CTCF transcription factor (TF) in ENCODE
ChIP-seq narrow peak samples from GM12878 lymphoblastoid human cell line which are the nearest regions farther than 100 kb from a
transcription start site (TSS). For the same cell line, find also all peaks of the H3K4me1 histone modification (HM) which are also the nearest
regions farther than 100 kb from a TSS. Then, out of the TF and HM peaks found in the same cell line, return all TF peaks that overlap with at
least a HM peak and known enhancer (EN) region.
Different Transcription Factors datasets at different time
points
Different annotation files are released
periodically
Recomputation of the query may
only be required if we can
estimate a considerable change
28
Summary and selected references
ReComp is a customizable framework to enable optimisations of data-intensive processes
It reacts to changes in data by trying to predict their impact on process outcomes
When provenance is available, it can be used to optimize process re-run
Prototype implementation exists
Extensions:
- controlling refresh of query results
- queries on data source establish which data changes we care about
- P. Missier and J. Cala, “Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes: Computational Framework,
Reference Architecture, and Applications,” in Procs. IEEE Big Data Congress, 2019.
- J. Cała and P. Missier, “Selective and Recurring Re-computation of Big Data Analytics Tasks: Insights from a Genomics Case Study,” Big Data
Res., vol. 13, pp. 76–94, Sep. 2018.
- J. Cala and P. Missier, “Provenance Annotation and Analysis to Support Process Re-Computation,” in Procs. IPAW 2018, 2018.
CMLSworkshop2020–P.Missier

Contenu connexe

Tendances

Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
 
ReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case StudyReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case StudyPaolo Missier
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’ Paolo Missier
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthPaolo Missier
 
PR157: Best of both worlds: human-machine collaboration for object annotation
PR157: Best of both worlds: human-machine collaboration for object annotationPR157: Best of both worlds: human-machine collaboration for object annotation
PR157: Best of both worlds: human-machine collaboration for object annotationjaewon lee
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016Paolo Missier
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain
 
Materials Informatics Overview
Materials Informatics OverviewMaterials Informatics Overview
Materials Informatics OverviewTony Fast
 
ReComp: Preserving the value of large scale data analytics over time through...
ReComp:Preserving the value of large scale data analytics over time through...ReComp:Preserving the value of large scale data analytics over time through...
ReComp: Preserving the value of large scale data analytics over time through...Paolo Missier
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud Paolo Missier
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Anubhav Jain
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
 

Tendances (20)

Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
 
ReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case StudyReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case Study
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’
 
ReComp for genomics
ReComp for genomicsReComp for genomics
ReComp for genomics
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
PR157: Best of both worlds: human-machine collaboration for object annotation
PR157: Best of both worlds: human-machine collaboration for object annotationPR157: Best of both worlds: human-machine collaboration for object annotation
PR157: Best of both worlds: human-machine collaboration for object annotation
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
 
Materials Informatics Overview
Materials Informatics OverviewMaterials Informatics Overview
Materials Informatics Overview
 
ReComp: Preserving the value of large scale data analytics over time through...
ReComp:Preserving the value of large scale data analytics over time through...ReComp:Preserving the value of large scale data analytics over time through...
ReComp: Preserving the value of large scale data analytics over time through...
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 
Dynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application TestingDynamic Symbolic Database Application Testing
Dynamic Symbolic Database Application Testing
 

Similaire à ReComp: optimising the re-execution of analytics pipelines in response to changes in the data

2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflowsmyGrid team
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridIan Foster
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and modelsmyGrid team
 
Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...Paolo Missier
 
Accelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsAccelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsPriscill Orue Esquivel
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuAnne Deslattes Mays
 
Processing Amplicon Sequence Data for the Analysis of Microbial Communities
Processing Amplicon Sequence Data for the Analysis of Microbial CommunitiesProcessing Amplicon Sequence Data for the Analysis of Microbial Communities
Processing Amplicon Sequence Data for the Analysis of Microbial CommunitiesMartin Hartmann
 
ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp: challenges in selective recomputation of (expensive) data analytics t...ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp: challenges in selective recomputation of (expensive) data analytics t...Paolo Missier
 
La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...tuxette
 
COMBINE standards & tools: Getting model management right
COMBINE standards & tools: Getting model management rightCOMBINE standards & tools: Getting model management right
COMBINE standards & tools: Getting model management rightUniversity Medicine Greifswald
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009Ian Foster
 
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...Wesley De Neve
 
De los rasgos poligénicos a los poligenómicos 250517
De los rasgos poligénicos a los poligenómicos 250517De los rasgos poligénicos a los poligenómicos 250517
De los rasgos poligénicos a los poligenómicos 250517M. Gonzalo Claros
 
Prediction and Meta-Analysis
Prediction and Meta-AnalysisPrediction and Meta-Analysis
Prediction and Meta-AnalysisGolden Helix
 
Prediction and Meta-Analysis
Prediction and Meta-AnalysisPrediction and Meta-Analysis
Prediction and Meta-AnalysisGolden Helix Inc
 
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...Environmental Intelligence Lab
 
Software Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The UglySoftware Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The UglyJoão André Carriço
 

Similaire à ReComp: optimising the re-execution of analytics pipelines in response to changes in the data (20)

2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
BioData World Basel 2018
BioData World Basel 2018BioData World Basel 2018
BioData World Basel 2018
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
 
Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...
 
Accelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsAccelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methods
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
 
Processing Amplicon Sequence Data for the Analysis of Microbial Communities
Processing Amplicon Sequence Data for the Analysis of Microbial CommunitiesProcessing Amplicon Sequence Data for the Analysis of Microbial Communities
Processing Amplicon Sequence Data for the Analysis of Microbial Communities
 
ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp: challenges in selective recomputation of (expensive) data analytics t...ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp: challenges in selective recomputation of (expensive) data analytics t...
 
Data sharing and analysis
Data sharing and analysisData sharing and analysis
Data sharing and analysis
 
La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...
 
COMBINE standards & tools: Getting model management right
COMBINE standards & tools: Getting model management rightCOMBINE standards & tools: Getting model management right
COMBINE standards & tools: Getting model management right
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
 
De los rasgos poligénicos a los poligenómicos 250517
De los rasgos poligénicos a los poligenómicos 250517De los rasgos poligénicos a los poligenómicos 250517
De los rasgos poligénicos a los poligenómicos 250517
 
Prediction and Meta-Analysis
Prediction and Meta-AnalysisPrediction and Meta-Analysis
Prediction and Meta-Analysis
 
Prediction and Meta-Analysis
Prediction and Meta-AnalysisPrediction and Meta-Analysis
Prediction and Meta-Analysis
 
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
Curses, tradeoffs, and scalable management: advancing evolutionary direct pol...
 
Software Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The UglySoftware Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The Ugly
 

Plus de Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationPaolo Missier
 
Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Paolo Missier
 
Mind My Value: A decentralised infrastructure for fair and trusted IoT data ...
Mind My Value:  A decentralised infrastructure for fair and trusted IoT data ...Mind My Value:  A decentralised infrastructure for fair and trusted IoT data ...
Mind My Value: A decentralised infrastructure for fair and trusted IoT data ...Paolo Missier
 

Plus de Paolo Missier (19)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-Computation
 
Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)
 
Mind My Value: A decentralised infrastructure for fair and trusted IoT data ...
Mind My Value:  A decentralised infrastructure for fair and trusted IoT data ...Mind My Value:  A decentralised infrastructure for fair and trusted IoT data ...
Mind My Value: A decentralised infrastructure for fair and trusted IoT data ...
 

Dernier

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Dernier (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

ReComp: optimising the re-execution of analytics pipelines in response to changes in the data

  • 1. Paolo Missier School of Computing Newcastle University, UK 1st CMLS workshop @ER 2020 November 4, 2020 ReComp: optimising the re-execution of analytics pipelines in response to changes in the data
  • 2. 2 Topics Data evolution and its impact on data-driven processes The ProvONE provenance model and role in selective process re-run ReComp outlook: from source queries to analytics outcomes CMLSworkshop2020–P.Missier
  • 3. 3 Context Big Data The Big Analytics Machine Actionable Knowledge Analytics Data Science over time V3 V2 V1 Meta-knowledge Algorithms Tools Libraries Reference datasets t t t CMLSworkshop2020–P.Missier
  • 4. 4 CMLSworkshop2020–P.Missier Reacting to changes in inputs 1. Always refresh 2. Approximate x1 x2 y1 d1 d2 f()
  • 6. 6 Case study: genomics Image credits: Broad Institute https://software.broadinstitute.org/gatk/ https://www.genomicsengland.co.uk/the-100000-genomes-project/ Spark GATK tools on Azure: 45 mins / GB @ 13GB / exome: about 10 hours Variant calling / interpretation CMLSworkshop2020–P.Missier
  • 7. 7 Genomics: WES / WGS, Variant calling  Variant interpretation SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer SVI: Simple Variant Interpretation Variant classification : pathogenic, benign and unknown/uncertain CMLSworkshop2020–P.Missier
  • 8. 8 Reacting to changes in inputs Refresh heuristics: 1. Changes in patient’s phenotype  Refresh 2. Changes in patient’s variants: - New variants appear  run SVI on those new variants only - Variants removed (rare)  remove those from previous SVI output CMLSworkshop2020–P.Missier The function computed by the workflow is not stable! Small input perturbation  possibly large impact
  • 9. 9 Changes in data dependencies <top-k annotated variants> = f(varset, phenotype, ClinVar, GeneMap) Evolution in number of variants that affect patients (a) with a specific phenotype (b) Across all phenotypes ClinVar - Monthly updates CMLSworkshop2020–P.Missier
  • 10. <eventname> Changes in variants: ViruSurf Frequence of single mutations in sequences collected from the population, taken at different time points ViruSurf ViruSurf-GISAID ViruSurf ViruSurf-GISAID ≤ 31/03/2020 ≥ 01/04/2020 With D614G 6,592 15034 23,649 18,421 Without D614G 4,664 8821 3,331 3369 D614% 58.56% 63.02% 87.65% 84.54% Total @ AUGUST 61.59% 86.26% Total @ OCTOBER 42.38% 90.00% Frequence of variants with high transmissibility Numbers change because: i) new knowledge arises from literature, ii) variants are re-computed by eliminating low quality sequences/sub-sequences, iii) more sequences are submitted from laboratories… Canakoglu A., Pinoli P., Bernasconi A., Alfonsi T., Melidis D.P., Ceri S. ViruSurf: an integrated database to investigate viral sequences. Nucleic Acids Research, 2020. https://doi.org/10.1093/nar/gkaa846 Bernasconi A., Canakoglu A., Pinoli P., Ceri S. Empowering Virus Sequence Research through Conceptual Modeling. In Proceedings of the 39th International Conference on Conceptual Modeling (ER 2020).
  • 11. 11 Diff functions for SVI: ClinVar ClinVar 1/2016 ClinVar 1/2017 diff (unchanged) Relational data  simple set difference The ClinVar dataset: 30 columns Changes: Records: 349,074  543,841 Added 200,746 Removed 5,979. Updated 27,662 CMLSworkshop2020–P.Missier
  • 12. 13 Reacting to changes in data dependencies Impact: “Any variant with status moving from/to Red causes High impact on any patient who is affected by the variant” Observation: Variants v within output set y that are in scope for patient X remain in scope! (monotonicity) 1. Variant v changes status - unknown  benign - unknown  deleterious 2. Brand new variant  If in scope: compare status before / after inexpensive  recompute SVI on all inputs expensive Scope: which cases are affected? “a change in variant v can only have impact on a case X if V and X share the same phenotype” CMLSworkshop2020–P.Missier
  • 13. 14 SVI is unstable Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected 4.2 hours of computation per change Should we care about updates? Evolving knowledge about gene variations CMLSworkshop2020–P.Missier
  • 14. 15 Empirical evaluation re-executions 495  71 Ideal: 14 But: no false negatives CMLSworkshop2020–P.Missier
  • 15. 16 General Problem formulation d: current state of data source D Changes in d: Problem: for each x in X estimate impact without computing f(x,d) (assuming f is not stable) Scope of impact: CMLSworkshop2020–P.Missier Impact due to changes in d: Impact estimation:
  • 16. 17 Diff and impact functions for white box process - Data-specific - Process-specificomim clinvar Overall impact impact on the SVI output impact on ‘p1 select genes’ CMLSworkshop2020–P.Missier
  • 17. 18 White box ReComp and Process Provenance White box process structure  detailed history of each execution  more granular optimisations become possible Change in ClinVar Change in GeneMap CMLSworkshop2020–P.Missier
  • 19. 20 Prov + process structure = ProvONE User Execution «Association » «Usage» «Generation » «Entity» «Collection» Controller Program Workflow Channel Port wasPartOf «hadMember » «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls[*] [*] [*] [*] [*] [*] «wasDerivedFrom » [*][*] [0..1] [0..1] [0..1] [*][1] [*] [*] [0..1] [0..1] hasOutPort [*][0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*][0..1] connectsTo [*] [0..1] «wasInformedBy » [*][1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPorthadInPort [*][1] [1] [1] [1] [1] hadEntity hasDefaultParam CMLSworkshop2020–P.Missier
  • 20. 21 Workflow Provenance: example ProvONE combines process structure with runtime provenance “plan” “plan execution” WF B1 B2 B1exec B3exec D WFexec wasPartOf wasPartOf usedwasGeneratedBy wasAssociatedWith dataSource usage ProgramWorkflow Execution Entity (ref data) O1 B3 hasSubProgram O2 I1 I2 wasAssociatedWith(wfexec, wf) wasAssociatedWith(B1exec, B1) wasAssociatedWith(B3exec, B3) wasPartOf(B1exec, wfexec) wasPartOf(B3exec, wfexec) wasGeneratedBy(g,D,B1exec) used(u,B3exec,D) hadInPort(u, I2) hadOutPort(g, O2) wf isa Workflow hasSubProgram(wf, B1) hasSubProgram(wf, B2) hasSubProgram(wf, B3) hasOutPort(B3,O1) hasInPort(B3,I2) connectsTo(O2, c) connectsTo(I2, c) … wasAssociatedWith wasAssociatedWith B1exec is an instance of program B1, B3exec is an instance of program B3, which are part of Workflow wf B3 has input port I2; B1 has output port O2, and these are connected (through channel c) Data item D was generated by B1exec on port O2 and used by B3exec on port I2 CMLSworkshop2020–P.Missier
  • 21. 22 ProvONE enables selective re-execution ProvONE traces the execution of nested workflows Enables selective re-computation of the fragments affected by a change CMLSworkshop2020–P.Missier
  • 22. 23 Summary: the ReComp meta-process History DB Detect and quantify changes data diff(d,d’) Record execution history Analytics Process P Log / provenance Partially Re-exec P (D) P(D’) Change Events Changes: • Reference datasets • Inputs For each past instances: Estimate impact of changes Impact(dd’, o) impact estimation functions Scope Select relevant sub-processes Optimisation CMLSworkshop2020–P.Missier
  • 23. 24 End-to-end ReComp ObserveControl ReComp CMLSworkshop2020–P.Missier - When should I rerun a query? - Which data changes should I react to? integration Queries The Genometric query language. (*) (*) Masseroli M, Canakoglu A, Pinoli P, Kaitoua A, Gulino A, Horlova O, Nanni L, Bernasconi A, Perna S, Stamoulakatou E, Ceri S. Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data. Bioinformatics, 2018. https://doi.org/10.1093/bioinformatics/bty688
  • 24. <eventname> The GenoMetric Query Language Masseroli M, Canakoglu A, Pinoli P, Kaitoua A, Gulino A, Horlova O, Nanni L, Bernasconi A, Perna S, Stamoulakatou E, Ceri S. Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data. Bioinformatics, 2018. https://doi.org/10.1093/bioinformatics/bty688 A query processing environment for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets
  • 25. <eventname> ReComp for GenoMetric Query Language 1 For kidney cancers, find mutations and their number in each exon _2017_11 Different somatic mutation datasets at different time points Different annotation files are released periodically Recomputation of the query may only be required if we can estimate that > 10% samples would obtain a changed count
  • 26. <eventname> ReComp for GenoMetric Query Language 3 Find distal bindings in transcription regulatory regions: Find all enriched regions (peaks) of CTCF transcription factor (TF) in ENCODE ChIP-seq narrow peak samples from GM12878 lymphoblastoid human cell line which are the nearest regions farther than 100 kb from a transcription start site (TSS). For the same cell line, find also all peaks of the H3K4me1 histone modification (HM) which are also the nearest regions farther than 100 kb from a TSS. Then, out of the TF and HM peaks found in the same cell line, return all TF peaks that overlap with at least a HM peak and known enhancer (EN) region. Different Transcription Factors datasets at different time points Different annotation files are released periodically Recomputation of the query may only be required if we can estimate a considerable change
  • 27. 28 Summary and selected references ReComp is a customizable framework to enable optimisations of data-intensive processes It reacts to changes in data by trying to predict their impact on process outcomes When provenance is available, it can be used to optimize process re-run Prototype implementation exists Extensions: - controlling refresh of query results - queries on data source establish which data changes we care about - P. Missier and J. Cala, “Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes: Computational Framework, Reference Architecture, and Applications,” in Procs. IEEE Big Data Congress, 2019. - J. Cała and P. Missier, “Selective and Recurring Re-computation of Big Data Analytics Tasks: Insights from a Genomics Case Study,” Big Data Res., vol. 13, pp. 76–94, Sep. 2018. - J. Cala and P. Missier, “Provenance Annotation and Analysis to Support Process Re-Computation,” in Procs. IPAW 2018, 2018. CMLSworkshop2020–P.Missier

Notes de l'éditeur

  1. f: X \rightarrow Y \mathbf{y} = f(\mathbf{x}) \mathbf{x} = [x_1 \dots x_n]\\  \mathbf{y} = [y_1 \dots y_m] \delta_Y(y,y') > \Delta_Y &\text{simply compute } \mathbf{y'} = f(\mathbf{x'}) \\ &\text{inefficient if computing $f(.)$ is expensive, and}\\  &\text{$y, y’$ turn out to be very similar to each other} &\text{find a new function } f’(.) \text{ that approximates } f(.) \\ &\text{return } f'(\mathbf{x'})
  2. &\text{Define a distance metric }\delta_Y \text{ on }Y \\ &\text{try and estimate } \delta_Y(y,y') \text{ \emph{without explicitly computing} } y’ \\ &\text{if } \delta_Y(y,y')> \Delta_Y \text{ for a set threshold } \Delta_Y \text{ then compute } f(\mathbf{x’}) &\text{This approach works well when: }\\ &\text{1. Distance metrics can be defined on both $X$ and $Y$: $\delta_X, \delta_Y$} \\ &\text{2. $f(.)$ is \emph{stable}:} \quad {\delta_X(\mathbf{x},\mathbf{x'}) < \epsilon_X \Rightarrow \delta_Y(\mathbf{y},\mathbf{y'}) < \epsilon_Y}
  3. We are going to use this smaller process as a testbed Changes in the reference databases have an impact on the classification
  4. \delta_4(\text{CV}^t, \text{CV}^{t'}) = & \langle \delta_4^+,  \delta_4^-, \delta_4^{\pm} \rangle 
  5. This is only a small selection of rows and a subset of columns. In total there was 30 columns, 349074 rows in the old set, 543841 rows in the new set, 200746 of the added rows, 5979 of the removed rows, 27662 of the changed rows. As on the previous slide, you may want to highlight that the selection of key-columns and where-columns is very important. For example, using #AlleleID, Assembly and Chromosome as the key columns, we have entry #AlleleID 15091 which looks very similar in both added (green) and removed (red) sets. They differ, however, in the Chromosome column. Considering the where-columns, using only ClinicalSignificance returns blue rows which differ between versions only in that columns. Changes in other columns (e.g. LastEvaluated) are not reported, which may have ramifications if such a difference is used to produce the new output.
  6. \text{let } v \in \diff{Y}(Y^t, Y^{t'}): \\ \text{for any $X$: } \impact_{P}(C,X) = \texttt{High} \text{ if }\\ v.\texttt{status:} \begin{cases} * \rightarrow \texttt{red} \\ \texttt{red} \rightarrow *  \end{cases}
  7. Threats: Will any of the changes invalidate prior findings? Opportunities: Can the findings be improved over time? Can we do better in a generic way? We need to control re-computation on two dimensions Across a population Within a single process
  8. \delta
  9. Impact functions are currently only binary
  10. Shows Essential ProvONE fragment used by ReComp
  11. The framework is a meta-process… Changes can also occur to OS, libraries and other dependencies but these are out of scope