Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Paolo Missier
School of Computing
Newcastle University, UK
1st CMLS workshop @ER 2020
November 4, 2020
ReComp: optimising ...
2
Topics
Data evolution and its impact on data-driven processes
The ProvONE provenance model and role in selective process...
3
Context
Big
Data
The Big
Analytics
Machine
Actionable
Knowledge
Analytics
Data Science over time V3
V2
V1
Meta-knowledge...
4
CMLSworkshop2020–P.Missier
Reacting to changes in inputs
1. Always refresh
2. Approximate
x1
x2
y1
d1 d2
f()
5
A refresh-if-needed approachCMLSworkshop2020–P.Missier
6
Case study: genomics
Image credits: Broad Institute https://software.broadinstitute.org/gatk/
https://www.genomicsenglan...
7
Genomics: WES / WGS, Variant calling  Variant interpretation
SVI: a simple single-nucleotide Human Variant Interpretati...
8
Reacting to changes in inputs
Refresh heuristics:
1. Changes in patient’s phenotype  Refresh
2. Changes in patient’s va...
9
Changes in data dependencies
<top-k annotated variants> = f(varset, phenotype, ClinVar, GeneMap)
Evolution in number of ...
<eventname>
Changes in variants: ViruSurf
Frequence of single mutations in sequences collected from the population, taken ...
11
Diff functions for SVI: ClinVar
ClinVar 1/2016 ClinVar 1/2017
diff
(unchanged)
Relational data  simple set difference
...
13
Reacting to changes in data dependencies
Impact: “Any variant with status moving from/to Red causes High impact on any
...
14
SVI is unstable
Sparsity issue:
• About 500 executions
• 33 patients
• total runtime about 60 hours
• Only 14 relevant ...
15
Empirical evaluation
re-executions 495  71 Ideal: 14
But: no false negatives
CMLSworkshop2020–P.Missier
16
General Problem formulation
d: current state of data source D
Changes in d:
Problem: for each x in X
estimate impact wi...
17
Diff and impact functions for white box process
- Data-specific
- Process-specificomim
clinvar
Overall
impact
impact on...
18
White box ReComp and Process Provenance
White box process structure 
detailed history of each execution 
more granula...
19
PROV (2013)CMLSworkshop2020–P.Missier
https://www.w3.org/TR/prov-dm/
20
Prov + process structure = ProvONE
User Execution
«Association » «Usage» «Generation »
«Entity»
«Collection»
Controller...
21
Workflow Provenance: example
ProvONE combines process structure with runtime provenance
“plan”
“plan
execution”
WF
B1
B...
22
ProvONE enables selective re-execution
ProvONE traces the execution of nested workflows
Enables selective re-computatio...
23
Summary: the ReComp meta-process
History
DB
Detect and
quantify
changes
data diff(d,d’)
Record execution
history
Analyt...
24
End-to-end ReComp
ObserveControl
ReComp
CMLSworkshop2020–P.Missier
- When should I rerun a query?
- Which data changes ...
<eventname>
The GenoMetric Query Language
Masseroli M, Canakoglu A, Pinoli P, Kaitoua A, Gulino A, Horlova O, Nanni L, Ber...
<eventname>
ReComp for GenoMetric Query Language 1
For kidney cancers,
find mutations and
their number in each exon
_2017_...
<eventname>
ReComp for GenoMetric Query Language 3
Find distal bindings in transcription regulatory regions: Find all enri...
28
Summary and selected references
ReComp is a customizable framework to enable optimisations of data-intensive processes
...
Prochain SlideShare
Chargement dans…5
×

ReComp: optimising the re-execution of analytics pipelines in response to changes in the data

53 vues

Publié le

A keynote talk given at at the 1st CMLS workshop @ER 2020

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

ReComp: optimising the re-execution of analytics pipelines in response to changes in the data

  1. 1. Paolo Missier School of Computing Newcastle University, UK 1st CMLS workshop @ER 2020 November 4, 2020 ReComp: optimising the re-execution of analytics pipelines in response to changes in the data
  2. 2. 2 Topics Data evolution and its impact on data-driven processes The ProvONE provenance model and role in selective process re-run ReComp outlook: from source queries to analytics outcomes CMLSworkshop2020–P.Missier
  3. 3. 3 Context Big Data The Big Analytics Machine Actionable Knowledge Analytics Data Science over time V3 V2 V1 Meta-knowledge Algorithms Tools Libraries Reference datasets t t t CMLSworkshop2020–P.Missier
  4. 4. 4 CMLSworkshop2020–P.Missier Reacting to changes in inputs 1. Always refresh 2. Approximate x1 x2 y1 d1 d2 f()
  5. 5. 5 A refresh-if-needed approachCMLSworkshop2020–P.Missier
  6. 6. 6 Case study: genomics Image credits: Broad Institute https://software.broadinstitute.org/gatk/ https://www.genomicsengland.co.uk/the-100000-genomes-project/ Spark GATK tools on Azure: 45 mins / GB @ 13GB / exome: about 10 hours Variant calling / interpretation CMLSworkshop2020–P.Missier
  7. 7. 7 Genomics: WES / WGS, Variant calling  Variant interpretation SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer SVI: Simple Variant Interpretation Variant classification : pathogenic, benign and unknown/uncertain CMLSworkshop2020–P.Missier
  8. 8. 8 Reacting to changes in inputs Refresh heuristics: 1. Changes in patient’s phenotype  Refresh 2. Changes in patient’s variants: - New variants appear  run SVI on those new variants only - Variants removed (rare)  remove those from previous SVI output CMLSworkshop2020–P.Missier The function computed by the workflow is not stable! Small input perturbation  possibly large impact
  9. 9. 9 Changes in data dependencies <top-k annotated variants> = f(varset, phenotype, ClinVar, GeneMap) Evolution in number of variants that affect patients (a) with a specific phenotype (b) Across all phenotypes ClinVar - Monthly updates CMLSworkshop2020–P.Missier
  10. 10. <eventname> Changes in variants: ViruSurf Frequence of single mutations in sequences collected from the population, taken at different time points ViruSurf ViruSurf-GISAID ViruSurf ViruSurf-GISAID ≤ 31/03/2020 ≥ 01/04/2020 With D614G 6,592 15034 23,649 18,421 Without D614G 4,664 8821 3,331 3369 D614% 58.56% 63.02% 87.65% 84.54% Total @ AUGUST 61.59% 86.26% Total @ OCTOBER 42.38% 90.00% Frequence of variants with high transmissibility Numbers change because: i) new knowledge arises from literature, ii) variants are re-computed by eliminating low quality sequences/sub-sequences, iii) more sequences are submitted from laboratories… Canakoglu A., Pinoli P., Bernasconi A., Alfonsi T., Melidis D.P., Ceri S. ViruSurf: an integrated database to investigate viral sequences. Nucleic Acids Research, 2020. https://doi.org/10.1093/nar/gkaa846 Bernasconi A., Canakoglu A., Pinoli P., Ceri S. Empowering Virus Sequence Research through Conceptual Modeling. In Proceedings of the 39th International Conference on Conceptual Modeling (ER 2020).
  11. 11. 11 Diff functions for SVI: ClinVar ClinVar 1/2016 ClinVar 1/2017 diff (unchanged) Relational data  simple set difference The ClinVar dataset: 30 columns Changes: Records: 349,074  543,841 Added 200,746 Removed 5,979. Updated 27,662 CMLSworkshop2020–P.Missier
  12. 12. 13 Reacting to changes in data dependencies Impact: “Any variant with status moving from/to Red causes High impact on any patient who is affected by the variant” Observation: Variants v within output set y that are in scope for patient X remain in scope! (monotonicity) 1. Variant v changes status - unknown  benign - unknown  deleterious 2. Brand new variant  If in scope: compare status before / after inexpensive  recompute SVI on all inputs expensive Scope: which cases are affected? “a change in variant v can only have impact on a case X if V and X share the same phenotype” CMLSworkshop2020–P.Missier
  13. 13. 14 SVI is unstable Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected 4.2 hours of computation per change Should we care about updates? Evolving knowledge about gene variations CMLSworkshop2020–P.Missier
  14. 14. 15 Empirical evaluation re-executions 495  71 Ideal: 14 But: no false negatives CMLSworkshop2020–P.Missier
  15. 15. 16 General Problem formulation d: current state of data source D Changes in d: Problem: for each x in X estimate impact without computing f(x,d) (assuming f is not stable) Scope of impact: CMLSworkshop2020–P.Missier Impact due to changes in d: Impact estimation:
  16. 16. 17 Diff and impact functions for white box process - Data-specific - Process-specificomim clinvar Overall impact impact on the SVI output impact on ‘p1 select genes’ CMLSworkshop2020–P.Missier
  17. 17. 18 White box ReComp and Process Provenance White box process structure  detailed history of each execution  more granular optimisations become possible Change in ClinVar Change in GeneMap CMLSworkshop2020–P.Missier
  18. 18. 19 PROV (2013)CMLSworkshop2020–P.Missier https://www.w3.org/TR/prov-dm/
  19. 19. 20 Prov + process structure = ProvONE User Execution «Association » «Usage» «Generation » «Entity» «Collection» Controller Program Workflow Channel Port wasPartOf «hadMember » «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls[*] [*] [*] [*] [*] [*] «wasDerivedFrom » [*][*] [0..1] [0..1] [0..1] [*][1] [*] [*] [0..1] [0..1] hasOutPort [*][0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*][0..1] connectsTo [*] [0..1] «wasInformedBy » [*][1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPorthadInPort [*][1] [1] [1] [1] [1] hadEntity hasDefaultParam CMLSworkshop2020–P.Missier
  20. 20. 21 Workflow Provenance: example ProvONE combines process structure with runtime provenance “plan” “plan execution” WF B1 B2 B1exec B3exec D WFexec wasPartOf wasPartOf usedwasGeneratedBy wasAssociatedWith dataSource usage ProgramWorkflow Execution Entity (ref data) O1 B3 hasSubProgram O2 I1 I2 wasAssociatedWith(wfexec, wf) wasAssociatedWith(B1exec, B1) wasAssociatedWith(B3exec, B3) wasPartOf(B1exec, wfexec) wasPartOf(B3exec, wfexec) wasGeneratedBy(g,D,B1exec) used(u,B3exec,D) hadInPort(u, I2) hadOutPort(g, O2) wf isa Workflow hasSubProgram(wf, B1) hasSubProgram(wf, B2) hasSubProgram(wf, B3) hasOutPort(B3,O1) hasInPort(B3,I2) connectsTo(O2, c) connectsTo(I2, c) … wasAssociatedWith wasAssociatedWith B1exec is an instance of program B1, B3exec is an instance of program B3, which are part of Workflow wf B3 has input port I2; B1 has output port O2, and these are connected (through channel c) Data item D was generated by B1exec on port O2 and used by B3exec on port I2 CMLSworkshop2020–P.Missier
  21. 21. 22 ProvONE enables selective re-execution ProvONE traces the execution of nested workflows Enables selective re-computation of the fragments affected by a change CMLSworkshop2020–P.Missier
  22. 22. 23 Summary: the ReComp meta-process History DB Detect and quantify changes data diff(d,d’) Record execution history Analytics Process P Log / provenance Partially Re-exec P (D) P(D’) Change Events Changes: • Reference datasets • Inputs For each past instances: Estimate impact of changes Impact(dd’, o) impact estimation functions Scope Select relevant sub-processes Optimisation CMLSworkshop2020–P.Missier
  23. 23. 24 End-to-end ReComp ObserveControl ReComp CMLSworkshop2020–P.Missier - When should I rerun a query? - Which data changes should I react to? integration Queries The Genometric query language. (*) (*) Masseroli M, Canakoglu A, Pinoli P, Kaitoua A, Gulino A, Horlova O, Nanni L, Bernasconi A, Perna S, Stamoulakatou E, Ceri S. Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data. Bioinformatics, 2018. https://doi.org/10.1093/bioinformatics/bty688
  24. 24. <eventname> The GenoMetric Query Language Masseroli M, Canakoglu A, Pinoli P, Kaitoua A, Gulino A, Horlova O, Nanni L, Bernasconi A, Perna S, Stamoulakatou E, Ceri S. Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data. Bioinformatics, 2018. https://doi.org/10.1093/bioinformatics/bty688 A query processing environment for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets
  25. 25. <eventname> ReComp for GenoMetric Query Language 1 For kidney cancers, find mutations and their number in each exon _2017_11 Different somatic mutation datasets at different time points Different annotation files are released periodically Recomputation of the query may only be required if we can estimate that > 10% samples would obtain a changed count
  26. 26. <eventname> ReComp for GenoMetric Query Language 3 Find distal bindings in transcription regulatory regions: Find all enriched regions (peaks) of CTCF transcription factor (TF) in ENCODE ChIP-seq narrow peak samples from GM12878 lymphoblastoid human cell line which are the nearest regions farther than 100 kb from a transcription start site (TSS). For the same cell line, find also all peaks of the H3K4me1 histone modification (HM) which are also the nearest regions farther than 100 kb from a TSS. Then, out of the TF and HM peaks found in the same cell line, return all TF peaks that overlap with at least a HM peak and known enhancer (EN) region. Different Transcription Factors datasets at different time points Different annotation files are released periodically Recomputation of the query may only be required if we can estimate a considerable change
  27. 27. 28 Summary and selected references ReComp is a customizable framework to enable optimisations of data-intensive processes It reacts to changes in data by trying to predict their impact on process outcomes When provenance is available, it can be used to optimize process re-run Prototype implementation exists Extensions: - controlling refresh of query results - queries on data source establish which data changes we care about - P. Missier and J. Cala, “Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes: Computational Framework, Reference Architecture, and Applications,” in Procs. IEEE Big Data Congress, 2019. - J. Cała and P. Missier, “Selective and Recurring Re-computation of Big Data Analytics Tasks: Insights from a Genomics Case Study,” Big Data Res., vol. 13, pp. 76–94, Sep. 2018. - J. Cala and P. Missier, “Provenance Annotation and Analysis to Support Process Re-Computation,” in Procs. IPAW 2018, 2018. CMLSworkshop2020–P.Missier

×