Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

228 vues

Publié le

An invited talk given at the IEEE Big Data Congress 2019 in Milan, Italy

Publié dans : Technologie
  • Identifiez-vous pour voir les commentaires

  • Soyez le premier à aimer ceci

Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes

  1. 1. Paolo Missier and Jacek Cala Newcastle University, UK IEEE Big Data Congress Milan, Italy July 8th, 2019 Efficient Re-computation of Big Data Analytics Processes in the Presence of Changes In collaboration with • Institute of Genetic Medicine, Newcastle University • School of GeoSciences, Newcastle University
  2. 2. 2 Context Big Data The Big Analytics Machine Actionable Knowledge Analytics Data Science over time V3 V2 V1 Meta-knowledge Algorithms Tools Libraries Reference datasets t t t
  3. 3. 3 What changes? • Genomics • Reference databases • Algorithms and libraries • Simulation • Large parameter space • Input conditions • Machine Learning • Evolving ground truth datasets • Model re-training
  4. 4. 4 Genomics Image credits: Broad Institute https://software.broadinstitute.org/gatk/ https://www.genomicsengland.co.uk/the-100000-genomes-project/ Spark GATK tools on Azure: 45 mins / GB @ 13GB / exome: about 10 hours
  5. 5. 5 Genomics: WES / WGS, Variant calling  Variant interpretation SVI: a simple single-nucleotide Human Variant Interpretation tool for Clinical Use. Missier, P.; Wijaya, E.; Kirby, R.; and Keogh, M. In Procs. 11th International conference on Data Integration in the Life Sciences, Los Angeles, CA, 2015. Springer SVI: Simple Variant Interpretation Variant classification : pathogenic, benign and unknown/uncertain
  6. 6. 6 Blind reaction to change: a game of battleship Sparsity issue: • About 500 executions • 33 patients • total runtime about 60 hours • Only 14 relevant output changes detected 4.2 hours of computation per change Should we care about updates? Evolving knowledge about gene variations
  7. 7. 7 ReComp http://recomp.org.uk/ Outcome: A framework for selective Re-computation • Generic, Customisable Scope: expensive analysis + frequent changes + not all changes significant Challenge: Make re-computation efficient in response to changes Assumptions: Processes are • Observable • Reproducible • Estimates are cheap Insight: replace re-computation with change impact estimation Using history of past executions
  8. 8. 8 Reproducibility How Selective: - Across a cohort of past executions.  which subset of individuals? - Within a single re-execution  which process fragments? Change in ClinVar Change in GeneMap  Why, when, to what extent
  9. 9. 9 The rest of the talk • Approach • Architecture • Evaluation (case study)
  10. 10. 10 The ReComp meta-process History DB Detect and quantify changes data diff(d,d’) Record execution history Analytics Process P Log / provenance Partially Re-exec P (D) P(D’) Change Events Changes: • Reference datasets • Inputs For each past instances: Estimate impact of changes Impact(dd’, o) impact estimation functions Scope Select relevant sub-processes Optimisation
  11. 11. 11 How much do we know about P? Impact estimation Re-execution less more Process structure Execution trace black box I/O provenance I/O only All-or-nothing monolithic process, legacy  a complex simulator white box step-by-step provenance workflows, R / python code  genomics analyticsTypical process Fine-grained Impact Partial  restart trees (*) (*) Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018. London: Springer; 2018.
  12. 12. 12 SVI: data-diff and impact functions - Data-specific - Process-specificomim clinvar Overall impact
  13. 13. 13 Diff functions for SVI ClinVar 1/2016 ClinVar 1/2017 diff (unchanged) Relational data  simple set difference
  14. 14. 14 Example impact functions: SVI Returns True iff: - Known variants have moved in/out of Red status - New Red variants have appeared - Known Red variants have been retracted
  15. 15. 15 ReComp decision matrix for SVI Impact: yes / no / not assessed delta functions: data diff detected?
  16. 16. 16 Empirical validation PaoloMissier2019 IEEEBigDataCongress re-executions 495  71 Ideal: 14 But: no false negatives
  17. 17. 17 SVI implemented using workflow Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype
  18. 18. 18 Execution trace / Provenance User Execution «Association » «Usage» «Generation » «Entity» «Collection» Controller Program Workflow Channel Port wasPartOf «hadMember » «wasDerivedFrom » hasSubProgram «hadPlan » controlledBy controls[*] [*] [*] [*] [*] [*] «wasDerivedFrom » [*][*] [0..1] [0..1] [0..1] [*][1] [*] [*] [0..1] [0..1] hasOutPort [*][0..1] [1] «wasAssociatedWith » «agent » [1] [0..1] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [*] [0..1] [0..1] hasInPort [*][0..1] connectsTo [*] [0..1] «wasInformedBy » [*][1] «wasGeneratedBy » «qualifiedGeneration » «qualifiedUsage » «qualifiedAssociation » hadEntity «used » hadOutPorthadInPort [*][1] [1] [1] [1] [1] hadEntity hasDefaultParam
  19. 19. 19 SVI – restart trees Overhead: caching intermediate data Time savings Partial re-exec (sec) Complete re-exec Time saving (%) GeneMap 325 455 28.5 ClinVar 287 455 37 Change in ClinVar Change in GeneMap Cala J, Missier P. Provenance Annotation and Analysis to Support Process Re-Computation. In: Procs. IPAW 2018.
  20. 20. 20 Architecture <eventname> ReComp Core HDB «ProvONE store» Tabular-Diff Service Tabular-Diff Service Difference Function ReExecution Service A ReExecution Service A ReExecution Function Impact Service B Impact Service B Impact Function ReComp Loop User Process Runtime Environment Inputs Outputs Interface Di f f Se r vi c e Interface I mpa c t Se r vi c e Interface Re Exe c Se r vi c e Process and data provenance Prolog facts store/retrieve REST API External services REST API Executes restart trees - React to change events - Construct restart trees
  21. 21. 21 Customising ReComp in practice <eventname> Enable provenance capture / Map to PROV
  22. 22. 22 Summary <eventname> Evaluation: case-by-case basis - Cost savings - Ease of customisation Generic framework Fine-grained provenance + control  max savings Tested on two cases studies - Genomics - Simulation (flood modelling)  see paper

×