SlideShare une entreprise Scribd logo
1  sur  42
ReComp–KeeleUniversity
Dec.2016–P.Missier
ReComp:
Preserving the value of large scale data analytics over time
through selective re-computation
recomp.org.uk
Paolo Missier, Jacek Cala, Manisha Rathi
School of Computing Science
Newcastle University
Keele University, Dec. 2016
(*) Painting by Johannes Moreelse
(*)
Panta Rhei (Heraclitus)
ReComp–KeeleUniversity
Dec.2016–P.Missier
2
Data Science
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
ReComp–KeeleUniversity
Dec.2016–P.Missier
3
Data Science over time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
ReComp–KeeleUniversity
Dec.2016–P.Missier
4
Example: supervised learning
Meta-knowledge
Training
set
Model
learning
Classification
algorithms
Predictive
classifier
Background
Knowledge
(prior)
 the training set is no longer representative of current data  the model loses predictive power
Ex.: training set is a sample from social media stream (Twitter, Instagram, …)
• Incremental training: established (neural networks, Bayes classifiers, …)
• Incremental unlearning: some established work [1]
t
[1] Kidera, Takuya, Seiichi Ozawa, Shigeo Abe. “An Incremental Learning Algorithm of Ensemble Classifier Systems.” Neural
Networks, 2006, 6453–59. doi:10.1109/IJCNN.2006.247345.
[2] Polikar, R., L. Upda, S.S. Upda, and V. Honavar. “Learn++: An Incremental Learning Algorithm for Supervised Neural
Networks.” IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31, no. 4 (2001): 497–
508. doi:10.1109/5326.983933.
[3] Diehl, C.P., and G. Cauwenberghs. “SVM Incremental Learning, Adaptation and Optimization.” Proceedings of the
International Joint Conference on Neural Networks, 2003. 4, no. x (2003): 2685–90. doi:10.1109/IJCNN.2003.1223991.
ReComp–KeeleUniversity
Dec.2016–P.Missier
5
Example: stream Analytics
Meta-knowledge
Data
stream
Time
Series
analysis
Pattern
recognition
algorithms
- Temporal Patterns
- Activity detection
- User behaviour
- …
Background
Knowledge
• If the output is stable over time, can I save computation and deliver older
outcomes instead?
• How do I quantify the quality/ cost trade-offs?
ReComp–KeeleUniversity
Dec.2016–P.Missier
6
Analytics functions and their dependencies can be complex
Y = f(X, D) X inputs (vector of arbitrary data structures, “big data”)
D: vector of dependencies: libraries, reference data
Y outputs (vector of arbitrary data structures, “knowledge”)
Ex.:
machine learning
Using Python
and scikit-learn
Learn model to
recognise
activity pattern
Python 3
Ubuntu x.y.z
Azure VM
Model
training
Model
Scikit-learn
Numpy
Pandas
Ubuntu
on Azure
Dependencies
Training +
Testing
dataset
config
Ex.: workflow to
Identify mutations
in a patient’s
genome
Workflow
specification
WF manager
Linux VM
cluster on
Azure
Analyse
Input genome
variants
GATK/Picard/BWA
Workflow Manager
(and its own dependencies)
Ubuntu
on Azure
Dep.
Input
genome config
Ref
genome
Variants
DBs
ReComp–KeeleUniversity
Dec.2016–P.Missier
7
Complex NGS pipelines
Recalibration
Corrects for system
bias on quality
scores assigned by
sequencer
GATK
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects
both SNV as well as longer
indels
Variant recalibration
attempts to reduce
false positive rate
from caller
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
ReComp–KeeleUniversity
Dec.2016–P.Missier
8
Problem size: HPC vs Cloud deployment
Configuration: HPC cluster (dedicated nodes):
3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM,
160 GB scratch space
Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB
SSD
00:00
12:00
24:00
36:00
48:00
60:00
72:00
0 6 12 18 24
Responsetime[hh:mm]
Number of samples
3 eng (24 cores) 6 eng (48 cores)
12 eng (96 cores)
Big Data:
• raw sequences for Whole Exome Sequencing (WES): 5–20GB per patient
• processed in cohorts of 20–40 or close to 1 TB per cohort
• time required to process a 24-sample cohort can easily exceed 2 CPU months
• WES is about 2% of what the Whole Genome Sequencing analyses require
ReComp–KeeleUniversity
Dec.2016–P.Missier
9
Understanding change: threats and opportunities
Big
Data
Life Sciences
Analytics
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
• Threats: Will any of the changes invalidate prior findings?
• Opportunities: Can the findings from the pipelines be improved over time?
• Cost: Need to model future costs based on past history and pricing trends for virtual appliances
• Impact analysis:
• Which patients/samples are likely to be affected?
• How do we estimate the potential benefits on affected patients?
• Can we estimate the impact of these changes without re-computing entire cohorts?
Changes:
• Algorithms and tools
• Accuracy of input sequences
• Reference databases (HGMD, ClinVar,
OMIM GeneMap, GeneCard,…)
ReComp–KeeleUniversity
Dec.2016–P.Missier
10
ReComp
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce
(analytics) processes
Big
Data
Life Sciences
Analytics
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
A decision support system for selectively re-computing complex analytics in reaction to
change
- Generic: not just for the life sciences
- Customisable: eg for genomics pipelines
ReComp–KeeleUniversity
Dec.2016–P.Missier
11
Challenges
3. Control How much control do we have on the system?
• Re-run: How often
• Total vs partial execution
• Input density / resolution / incremental update
• Eg nonmonotonic learning / unlearning
Change
Events
Diff(.,.)
functions
“business
Rules”
Optimal re-computation prioritisaton
Impact and Cost estimates
Reproducibility assessment
ReComp
Decision
Support
System
History of past
Knowledge Assets
1. Observability: To what extent can we observe the process and its execution?
• Process structure
• Data flow  provenance
2. Detecting and quantifying changes:
• In inputs, dependencies, outputs  diff() functions
ReComp–KeeleUniversity
Dec.2016–P.Missier
12
General ReComp problem formulation
ReComp–KeeleUniversity
Dec.2016–P.Missier
13
Change Impact
ReComp–KeeleUniversity
Dec.2016–P.Missier
14
Example: NGS variant interpretation
Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Also: Metagenomics: Species identification. Eg The EBI metagenomics portal
Can help to confirm/reject a hypothesis of patient’s phenotype
Classifies variants into three categories: RED, GREEN, AMBER
pathogenic, benign and unknown/uncertain
ReComp–KeeleUniversity
Dec.2016–P.Missier
15
The SVI example
ReComp–KeeleUniversity
Dec.2016–P.Missier
16
Change in variant interpretation
What changes:
- Improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources
ReComp–KeeleUniversity
Dec.2016–P.Missier
17
ReComp Problem Statement
1. Estimate impact of changes
2. Optimise ReComp decisions: select subset of population that maximises
espected impact, subject to a budget constraint
Problem: P computationally expensive
ReComp–KeeleUniversity
Dec.2016–P.Missier
18
Estimators: formalisation and a possible approach

And local changes
Problem: f() computationally expensive
Approach: learn an approximation f’() of f(): a surrogate (emulator)
Sensitivity Analysis:
Given
Assess
where ε is a stochastic term that accounts for the error in approximating f, and is typically
assumed to be Gaussian
Learning f’() requires a training set { (xi, yi) } …
If f’() can be found, then we can hope to use it to approximate:
which can then be used to carry out sensitivity analysis
For simplicity
ReComp–KeeleUniversity
Dec.2016–P.Missier
19
Scope of change
2. Change: affects a single patient  partial re-run
May affect a subset of the patients population  scope
Which patients will be affected?
1. Change:
ReComp–KeeleUniversity
Dec.2016–P.Missier
20
Challenge 1: battleships
Patient / change impact matrix
First challenge:
precisely identify the scope of a change
Blind reaction to change: recompute the entire matrix
Can we do better?
- Hit the high impact cases (the X) without re-
computing the entire matrix
ReComp–KeeleUniversity
Dec.2016–P.Missier
21
SVI process: detailed design
Phenotype to genes
Variant selection
Variant classification
Patient
variants
GeneMap
ClinVar
Classified variants
Phenotype
hypothesis
ReComp–KeeleUniversity
Dec.2016–P.Missier
22
Baseline: Blind recomputation
17 minutes / patient (single-core
VM)
Runtime consistent across different
phenotypes
Changes to GeneMap/ClinVar have
negligible impact on the execution
time
Run time [mm:ss]
GeneMap
version
2016-03-08 2016-04-28 2016-06-07
μ ± σ 17:05 ± 22 17:09 ± 15 17:10 ± 17
ReComp–KeeleUniversity
Dec.2016–P.Missier
23
Inside a single instance: Partial re-computation
Change in
ClinVar
Change in
GeneMap
ReComp–KeeleUniversity
Dec.2016–P.Missier
24
White-box granular provenance
x11
x12 y11
P
D11 D12
- Using provenance metadata to identify fragments of SVI that
are affected by the change in reference data
ReComp–KeeleUniversity
Dec.2016–P.Missier
26
Results
Run time
[mm:ss]
Saving
s
Run time
[mm:ss]
Saving
s
GeneMap
version
2016-04-28 2016-06-07
μ ± σ 11:51 ± 16 31% 11:50 ± 20 31%
ClinVar
version
2016-02 2016-05
μ ± σ 9:51 ± 14 43% 9:50 ± 15 42%
• How much can we save?
• Process structure
• First usage of reference data
• Overhead: storing interim data required in partial re-execution
• 20–22 MB for GeneMap changes and 2–334 kB for ClinVar changes
ReComp–KeeleUniversity
Dec.2016–P.Missier
27
Partial re-computation using input difference
Idea: run SVI but replace ClinVar query with a query on ClinVar version diff:
Q(CV)  Q(diff(CV1, CV2))
Works for SVI, but hard to generalise: depends on the type of process
Bigger gain: diff(CV1, CV2) much smaller than CV2
GeneMap versions
from –> to
ToVersion rec.
count
Difference
rec. count Reduction
16-03-08 –> 16-06-07 15910 1458 91%
16-03-08 –> 16-04-28 15871 1386 91%
16-04-28 –> 16-06-01 15897 78 99.5%
16-06-01 –> 16-06-02 15897 2 99.99%
16-06-02 –> 16-06-07 15910 33 99.8%
ClinVar versions
from –> to
ToVersion rec.
count
Difference
rec. count Reduction
15-02 –> 16-05 290815 38216 87%
15-02 –> 16-02 285042 35550 88%
16-02 –> 16-05 290815 3322 98.9%
ReComp–KeeleUniversity
Dec.2016–P.Missier
29
Saving resources on stream processing
x1
x2
…
xk
xk+1
…
x2k
W1
W2
Raw
stream
windows
P
P
y1
y2
… Wi+1 Wi Comp / noComp
…
yi-h-1
yi-h h<i
P
y’i
yi-h
yi
Baseline stream processing
Conditional stream processing
- If we could predict that yi+1 will be similar to
yi, we could skip computing P(Wi+1), save
resources and instead deliver yi again
- Can we make optimal comp/noComp
decisions? What is required?
ReComp–KeeleUniversity
Dec.2016–P.Missier
30
Diff and currency functions
the quality of yi is initially maximal, and decreases over time in a way that
depends on how rapidly the new values yj diverge from yi.
ReComp–KeeleUniversity
Dec.2016–P.Missier
31
Measuring DeComp performance
Evaluating the performance of comp / nocomp decisions on each window:
Cost:
- Very conservative DeComp
computes every value:
- Very optimistic, only computes
first value:
Boundary cases:
ReComp–KeeleUniversity
Dec.2016–P.Missier
32
Diff time series
ReComp–KeeleUniversity
Dec.2016–P.Missier
33
Forecasting drift
… Wi+1 Wi Comp / noComp
…
yi-h-1
yi-h h<i
P
y’i
yi
yi …
Derived
Time series
drift
forecasting
ReComp–KeeleUniversity
Dec.2016–P.Missier
34
Initial experiments: the DEBS’15 Taxi routes challenge
Find the most frequent / most profitable taxi routes in Manhattan
within each 30’ window
VehicId,LicId, Pickup date, Drop off date, Dur,Dist,PickupLon,
PickupLat,DropoffLon,DropofLat,Pay,Fare$, ...
0729...,E775...,2013-01-01 00:00:00,2013-01-01 00:02:00,120,0.44,-73.956528,40.716976,-
73.962440,40.715008,CSH, 3.50, ...
22D7...,3FF2...,2013-01-01 00:02:00,2013-01-01 00:02:00, 0,0.00, 0.000000, 0.000000, 0.000000,
0.000000,CSH,27.00, ...
0EC2...,778C...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.71,-73.973145,40.752827,-
73.965897,40.760445,CSH, 4.00, ...
1390...,BE31...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.48,-74.004173,40.720947,-
74.003838,40.726189,CSH, 4.00, ...
3B41...,7077...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.61,-73.987373,40.724861,-
73.983772,40.730995,CRD, 4.00, ...
5FAA...,00B7...,2013-01-01 00:02:00,2013-01-01 00:03:00, 60,0.00, 0.000000, 0.000000, 0.000000,
0.000000,CRD, 2.50, ...
DFBF...,CF86...,2013-01-01 00:02:00,2013-01-01 00:03:00, 60,0.39,-73.981544,40.781475,-
73.979439,40.784386,CRD, 3.00, ...
1E5F...,E0B2...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00,-73.993973,40.751266, 0.000000,
0.000000,CSH, 2.50, ...
4682...,BB89...,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.71,-73.955383,40.779728,-
73.967758,40.760326,CSH, 6.50, ...
5F78...,B756...,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.21,-73.973000,40.793140,-
73.981453,40.778465,CRD, 6.00, ...
6BA2...,ED36...,2013-01-01 00:01:00,2013-01-01 00:04:00,180,0.74,-73.971138,40.758980,-
73.972206,40.752502,CRD, 4.50, ...
75C9...,00B7...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00, 0.000000, 0.000000, 0.000000,
0.000000,CRD, 3.00, ...
C306...,E255...,2013-01-01 00:01:00,2013-01-01 00:04:00,180,0.84,-73.942841,40.797031,-
73.934540,40.797314,CSH, 4.50, ...
C4D6...,95B5...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00,-73.989189,40.721924, 0.000000,
0.000000,CSH, 2.50, ...ta
ReComp–KeeleUniversity
Dec.2016–P.Missier
35
Diff time series – taxi routes
Raw data stream
st1  ft1, x1y1  x2y2
st2  ft2, x3y3  x4y2
.
.
.
Routes time series
ft1, R1
ft2, R2
.
.
ftn, R1
ftn+1, R1
ftn+2, R3
.
.
ftm, R2
ftm+1, R4
.
.
Top-k time series
R1  Freq1
R2  Freq2
.
.
Rk  Freqk
Rk+1  Freqk+1
Rk+2  Freqk+2
.
.
R2k  Freq2k
R2k+1  Freq2k+1
.
.
W1
W2
W3
W1
W2
W3
=> =>
ReComp–KeeleUniversity
Dec.2016–P.Missier
36
Routes drift – comparing ranked lists
[1] Fagin, Ronald, Ravi Kumar, and D. Sivakumar. “Comparing Top K Lists.” SIAM Journal on
Discrete Mathematics 17, no. 1 (January 2003): 134–60. doi:10.1137/S0895480102412856.
P outputs a list of top most frequent/profitable routes
To compare lists we use the generalised Kendall’s tau (Fagin et al. [1])
Quantify how much the top-k changes between one window and the next
Input parameters determine stability / sensitivity:
K: how many routes
window size (e.g. 30’)
ReComp–KeeleUniversity
Dec.2016–P.Missier
370
0.2
0.4
0.6
0.8
1
1.2
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163
Drift function: top-10, window size: 2h, date range: [1/Jan 00:00–15/Jan 00:00)
0
0.2
0.4
0.6
0.8
1
1.2
Drift function: top-10, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)
0
0.2
0.4
0.6
0.8
1
1.2
1 25 49 73 97 121 145 169 193 217 241 265 289 313 337 361 385 409 433 457 481 505 529 553 577 601 625 649
Drift function: top-10, window size: 30m, date range: [1/Jan 00:00–15/Jan 00:00)
ReComp–KeeleUniversity
Dec.2016–P.Missier
38
0
0.2
0.4
0.6
0.8
1
1.2
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
101
106
111
116
121
126
131
136
141
146
151
156
161
166
171
176
181
186
191
196
201
206
211
216
221
226
231
236
241
246
251
256
261
266
271
276
281
286
291
296
301
306
311
316
321
326
331
Drift function: top-40, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)
0
0.2
0.4
0.6
0.8
1
1.2
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103109115121127133139145151157163169175181187193199205211217223229235241247253259265271277283289295301307313319325331
Drift function: top-20, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)
0
0.2
0.4
0.6
0.8
1
1.2
1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325
Drift function: top-10, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)
ReComp–KeeleUniversity
Dec.2016–P.Missier
39
Approach: ARIMA forecasting
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
Actual normalised drift vs ARIMA forecast
Drift function: top-10, window size = 1h, date range = [20/Jan 00:00–25/Jan 17:00)
new-day Actual norm-drift ARIMA(1,0,2)[1,0,1] forecast
Drift prediction using time series forecasting
• This is the derived diff() time series!
• Autoregressive integrated moving average (ARIMA)
• Widely used and well understood, well supported
• Fast to compute
• Assumes normality of underlying random variable
Poor prediction: compute P too often or too rarely
ReComp–KeeleUniversity
Dec.2016–P.Missier
40
The next steps -- challenges
• Can we learn effective surrogate models and estimators of change
impact?
• diff() functions, estimators seem very problem-specific
• To what extent can the ReComp framework be made generic,
reusable, yet still useful?
• Metadata infrastructure: A DB of past executions history
• Reproducibility: What really happens when I press the “ReComp”
button?
ReComp–KeeleUniversity
Dec.2016–P.Missier
41
Summary and challenges
Forwards: React to changes
in data used by processes
Backwards: restore value
of knowledge outcomes
Re-compute
Selected outcomes
Es0mate:
- Benefit
- Cost of refresh
Quan0fy
knowledge
decay
Es0mate:
- Impact of changes
- Cost of refresh
Quan0fy
data
changes
Monitor data
changes
Input,
reference data
versioning
Op0mise /
Priori0se
Outcomes
Knowledge
outcomes
Provenance,
Cost
New ground
truth
Data change events
ReComp:
a meta-process to observe and control underlying analytics processes
ReComp–KeeleUniversity
Dec.2016–P.Missier
42
ReComp scenarios
ReComp scenario Target Impact areas Why is ReComp
relevant?
Proof of concept
experiments
Expected
optimisation
Dataflow,
experimental
science
Genomics - Rapid Knowledge
advances
- Rapid scaling up
of genetic testing
at population level
WES/SVI pipeline,
workflow
implementation
(eScience Central)
Timeliness and
accuracy of patient
diagnosis subject to
budget constraints
Time series analysis - Personal health
monitoring
- Smart city
analytics
- IoT data streams
- Rapid data drift
- Cost of computation
at network edge (eg
IoT)
NYC taxi rides
challenge (DEBS’15)
Use of low-power
edge devices when
outcome is
predictable and data
drift is low
Data layer
optimisation
Tuning of large-scale
Data management
stack
Optimal Data
organisation sensitive
to current data
profiles
Graph DB re-
partitioning
System throughput vs
cost of re-tuning
Model learning Applications of
predictive analytics
Predictive models are
very sensitive to data
drift
Twitter content
analysis
Sustained model
predictive power over
time vs retraining
cost
Simulation TBD repeated simulation.
Computationally
expensive but often
not beneficial
Flood modelling /
CityCat Newcastle
Computational
resources vs
marginal benefit of
new simulation model
ReComp–KeeleUniversity
Dec.2016–P.Missier
43
Observability / transparency
White box Black box
Structure
(static view)
Dataflow
- eScience Central, Taverna, VisTrails…
Scripting:
- R, Matlab, Python...
- Functions semantics
- Packaged components
- Third party services
Data
dependencies
(runtime
view)
Provenance recording:
• Inputs,
• Reference datasets,
• Component versions,
• Outputs
• Input
• Outputs
• No data dependencies
• No details on individual
components
Cost • Detailed resource monitoring
• Cloud  £££
• Wall clock time
• Service pricing
• Setup time (eg model
learning)
ReComp–KeeleUniversity
Dec.2016–P.Missier
44
Project structure
• 3 years funding from the EPSRC (£585,000 grant) on the Making Sense from Data call
• Feb. 2016 - Jan. 2019
• 2 RAs fully employed in Newcastle
• PI: Dr. Missier, School of Computing Science, Newcastle University (30%)
• CO-Investigators (8% each):
• Prof. Watson, School of Computing Science, Newcastle University
• Prof. Chinnery, Department of Clinical Neurosciences, Cambridge University
• Dr. Phil James, Civil Engineering, Newcastle University
Builds upon the experience of the Cloud-e-Genome project: 2013-2015
Aims:
- To demonstrate cost-effective workflow-based processing of NGS pipelines on the cloud
- To facilitate the adoption of reliable genetic testing in clinical practice
- A collaboration between the Institute of Genetic Medicine and the School of Computing
Science at Newcastle University
- Funding: NIHR / Newcastle BRC (£180,000) plus $40,000 Microsoft Research grant “Azure
for Research”

Contenu connexe

Tendances

What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knimeGreg Landrum
 
Himansu sahoo resume-ds
Himansu sahoo resume-dsHimansu sahoo resume-ds
Himansu sahoo resume-dsHimansu Sahoo
 
La résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphesLa résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphesData2B
 
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsNG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsKan Yuenyong
 
Optique presentation
Optique presentationOptique presentation
Optique presentationDBOnto
 
COM 578 Empirical Methods in Machine Learning and Data Mining
COM 578 Empirical Methods in Machine Learning and Data MiningCOM 578 Empirical Methods in Machine Learning and Data Mining
COM 578 Empirical Methods in Machine Learning and Data Miningbutest
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataPablo Bernabeu
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Bertram Ludäscher
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
Multiple Models for Recommending Temporal Aspects of Entities
Multiple Models for Recommending Temporal Aspects of EntitiesMultiple Models for Recommending Temporal Aspects of Entities
Multiple Models for Recommending Temporal Aspects of EntitiesTu Nguyen
 
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...Big Data Spain
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
 

Tendances (19)

Project
ProjectProject
Project
 
Future of hpc
Future of hpcFuture of hpc
Future of hpc
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
 
Himansu sahoo resume-ds
Himansu sahoo resume-dsHimansu sahoo resume-ds
Himansu sahoo resume-ds
 
La résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphesLa résolution de problèmes à l'aide de graphes
La résolution de problèmes à l'aide de graphes
 
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsNG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
 
Optique presentation
Optique presentationOptique presentation
Optique presentation
 
COM 578 Empirical Methods in Machine Learning and Data Mining
COM 578 Empirical Methods in Machine Learning and Data MiningCOM 578 Empirical Methods in Machine Learning and Data Mining
COM 578 Empirical Methods in Machine Learning and Data Mining
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open data
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Multiple Models for Recommending Temporal Aspects of Entities
Multiple Models for Recommending Temporal Aspects of EntitiesMultiple Models for Recommending Temporal Aspects of Entities
Multiple Models for Recommending Temporal Aspects of Entities
 
CBS CEDAR Presentation
CBS CEDAR PresentationCBS CEDAR Presentation
CBS CEDAR Presentation
 
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
 
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
Big data
Big dataBig data
Big data
 

Similaire à ReComp: Preserving the value of large scale data analytics over time through selective re-computation

Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Manuel Martín
 
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...Daniel Valcarce
 
Interpretable machine-learning (in endocrinology and beyond)
Interpretable machine-learning (in endocrinology and beyond)Interpretable machine-learning (in endocrinology and beyond)
Interpretable machine-learning (in endocrinology and beyond)University of Groningen
 
Service Management: Forecasting Hydrogen Demand
Service Management: Forecasting Hydrogen DemandService Management: Forecasting Hydrogen Demand
Service Management: Forecasting Hydrogen Demandirrosennen
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
 
2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflowsmyGrid team
 
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...Matthieu Schapranow
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
In-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems MedicineIn-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems MedicineMatthieu Schapranow
 
20170110_IOuellette_CV
20170110_IOuellette_CV20170110_IOuellette_CV
20170110_IOuellette_CVIan Ouellette
 
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...Martin Chapman
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
Incremental and Streaming Model Transformations
Incremental and Streaming Model TransformationsIncremental and Streaming Model Transformations
Incremental and Streaming Model TransformationsIstván Dávid
 
Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Matthieu Schapranow
 

Similaire à ReComp: Preserving the value of large scale data analytics over time through selective re-computation (20)

Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?
 
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
Efficient Pseudo-Relevance Feedback Methods for Collaborative Filtering Recom...
 
Interpretable machine-learning (in endocrinology and beyond)
Interpretable machine-learning (in endocrinology and beyond)Interpretable machine-learning (in endocrinology and beyond)
Interpretable machine-learning (in endocrinology and beyond)
 
Service Management: Forecasting Hydrogen Demand
Service Management: Forecasting Hydrogen DemandService Management: Forecasting Hydrogen Demand
Service Management: Forecasting Hydrogen Demand
 
Research Proposal
Research ProposalResearch Proposal
Research Proposal
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows
 
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
In-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems MedicineIn-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems Medicine
 
20170110_IOuellette_CV
20170110_IOuellette_CV20170110_IOuellette_CV
20170110_IOuellette_CV
 
man0 ppt.pptx
man0 ppt.pptxman0 ppt.pptx
man0 ppt.pptx
 
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Incremental and Streaming Model Transformations
Incremental and Streaming Model TransformationsIncremental and Streaming Model Transformations
Incremental and Streaming Model Transformations
 
Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI
 

Plus de Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthPaolo Missier
 

Plus de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 

Dernier

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Dernier (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

ReComp: Preserving the value of large scale data analytics over time through selective re-computation

  • 1. ReComp–KeeleUniversity Dec.2016–P.Missier ReComp: Preserving the value of large scale data analytics over time through selective re-computation recomp.org.uk Paolo Missier, Jacek Cala, Manisha Rathi School of Computing Science Newcastle University Keele University, Dec. 2016 (*) Painting by Johannes Moreelse (*) Panta Rhei (Heraclitus)
  • 3. ReComp–KeeleUniversity Dec.2016–P.Missier 3 Data Science over time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t
  • 4. ReComp–KeeleUniversity Dec.2016–P.Missier 4 Example: supervised learning Meta-knowledge Training set Model learning Classification algorithms Predictive classifier Background Knowledge (prior)  the training set is no longer representative of current data  the model loses predictive power Ex.: training set is a sample from social media stream (Twitter, Instagram, …) • Incremental training: established (neural networks, Bayes classifiers, …) • Incremental unlearning: some established work [1] t [1] Kidera, Takuya, Seiichi Ozawa, Shigeo Abe. “An Incremental Learning Algorithm of Ensemble Classifier Systems.” Neural Networks, 2006, 6453–59. doi:10.1109/IJCNN.2006.247345. [2] Polikar, R., L. Upda, S.S. Upda, and V. Honavar. “Learn++: An Incremental Learning Algorithm for Supervised Neural Networks.” IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31, no. 4 (2001): 497– 508. doi:10.1109/5326.983933. [3] Diehl, C.P., and G. Cauwenberghs. “SVM Incremental Learning, Adaptation and Optimization.” Proceedings of the International Joint Conference on Neural Networks, 2003. 4, no. x (2003): 2685–90. doi:10.1109/IJCNN.2003.1223991.
  • 5. ReComp–KeeleUniversity Dec.2016–P.Missier 5 Example: stream Analytics Meta-knowledge Data stream Time Series analysis Pattern recognition algorithms - Temporal Patterns - Activity detection - User behaviour - … Background Knowledge • If the output is stable over time, can I save computation and deliver older outcomes instead? • How do I quantify the quality/ cost trade-offs?
  • 6. ReComp–KeeleUniversity Dec.2016–P.Missier 6 Analytics functions and their dependencies can be complex Y = f(X, D) X inputs (vector of arbitrary data structures, “big data”) D: vector of dependencies: libraries, reference data Y outputs (vector of arbitrary data structures, “knowledge”) Ex.: machine learning Using Python and scikit-learn Learn model to recognise activity pattern Python 3 Ubuntu x.y.z Azure VM Model training Model Scikit-learn Numpy Pandas Ubuntu on Azure Dependencies Training + Testing dataset config Ex.: workflow to Identify mutations in a patient’s genome Workflow specification WF manager Linux VM cluster on Azure Analyse Input genome variants GATK/Picard/BWA Workflow Manager (and its own dependencies) Ubuntu on Azure Dep. Input genome config Ref genome Variants DBs
  • 7. ReComp–KeeleUniversity Dec.2016–P.Missier 7 Complex NGS pipelines Recalibration Corrects for system bias on quality scores assigned by sequencer GATK Computes coverage of each read. VCF Subsetting by filtering, eg non-exomic variants Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations Aligns sample sequence to HG19 reference genome using BWA aligner Cleaning, duplicate elimination Picard tools Variant calling operates on multiple samples simultaneously Splits samples into chunks. Haplotype caller detects both SNV as well as longer indels Variant recalibration attempts to reduce false positive rate from caller raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants
  • 8. ReComp–KeeleUniversity Dec.2016–P.Missier 8 Problem size: HPC vs Cloud deployment Configuration: HPC cluster (dedicated nodes): 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM, 160 GB scratch space Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD 00:00 12:00 24:00 36:00 48:00 60:00 72:00 0 6 12 18 24 Responsetime[hh:mm] Number of samples 3 eng (24 cores) 6 eng (48 cores) 12 eng (96 cores) Big Data: • raw sequences for Whole Exome Sequencing (WES): 5–20GB per patient • processed in cohorts of 20–40 or close to 1 TB per cohort • time required to process a 24-sample cohort can easily exceed 2 CPU months • WES is about 2% of what the Whole Genome Sequencing analyses require
  • 9. ReComp–KeeleUniversity Dec.2016–P.Missier 9 Understanding change: threats and opportunities Big Data Life Sciences Analytics “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t • Threats: Will any of the changes invalidate prior findings? • Opportunities: Can the findings from the pipelines be improved over time? • Cost: Need to model future costs based on past history and pricing trends for virtual appliances • Impact analysis: • Which patients/samples are likely to be affected? • How do we estimate the potential benefits on affected patients? • Can we estimate the impact of these changes without re-computing entire cohorts? Changes: • Algorithms and tools • Accuracy of input sequences • Reference databases (HGMD, ClinVar, OMIM GeneMap, GeneCard,…)
  • 10. ReComp–KeeleUniversity Dec.2016–P.Missier 10 ReComp Observe change • In big data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Big Data Life Sciences Analytics “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t A decision support system for selectively re-computing complex analytics in reaction to change - Generic: not just for the life sciences - Customisable: eg for genomics pipelines
  • 11. ReComp–KeeleUniversity Dec.2016–P.Missier 11 Challenges 3. Control How much control do we have on the system? • Re-run: How often • Total vs partial execution • Input density / resolution / incremental update • Eg nonmonotonic learning / unlearning Change Events Diff(.,.) functions “business Rules” Optimal re-computation prioritisaton Impact and Cost estimates Reproducibility assessment ReComp Decision Support System History of past Knowledge Assets 1. Observability: To what extent can we observe the process and its execution? • Process structure • Data flow  provenance 2. Detecting and quantifying changes: • In inputs, dependencies, outputs  diff() functions
  • 14. ReComp–KeeleUniversity Dec.2016–P.Missier 14 Example: NGS variant interpretation Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis - Eg 100K Genome Project, Genomics England, GeCIP raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Also: Metagenomics: Species identification. Eg The EBI metagenomics portal Can help to confirm/reject a hypothesis of patient’s phenotype Classifies variants into three categories: RED, GREEN, AMBER pathogenic, benign and unknown/uncertain
  • 16. ReComp–KeeleUniversity Dec.2016–P.Missier 16 Change in variant interpretation What changes: - Improved sequencing / variant calling - ClinVar, OMIM evolve rapidly - New reference data sources
  • 17. ReComp–KeeleUniversity Dec.2016–P.Missier 17 ReComp Problem Statement 1. Estimate impact of changes 2. Optimise ReComp decisions: select subset of population that maximises espected impact, subject to a budget constraint Problem: P computationally expensive
  • 18. ReComp–KeeleUniversity Dec.2016–P.Missier 18 Estimators: formalisation and a possible approach  And local changes Problem: f() computationally expensive Approach: learn an approximation f’() of f(): a surrogate (emulator) Sensitivity Analysis: Given Assess where ε is a stochastic term that accounts for the error in approximating f, and is typically assumed to be Gaussian Learning f’() requires a training set { (xi, yi) } … If f’() can be found, then we can hope to use it to approximate: which can then be used to carry out sensitivity analysis For simplicity
  • 19. ReComp–KeeleUniversity Dec.2016–P.Missier 19 Scope of change 2. Change: affects a single patient  partial re-run May affect a subset of the patients population  scope Which patients will be affected? 1. Change:
  • 20. ReComp–KeeleUniversity Dec.2016–P.Missier 20 Challenge 1: battleships Patient / change impact matrix First challenge: precisely identify the scope of a change Blind reaction to change: recompute the entire matrix Can we do better? - Hit the high impact cases (the X) without re- computing the entire matrix
  • 21. ReComp–KeeleUniversity Dec.2016–P.Missier 21 SVI process: detailed design Phenotype to genes Variant selection Variant classification Patient variants GeneMap ClinVar Classified variants Phenotype hypothesis
  • 22. ReComp–KeeleUniversity Dec.2016–P.Missier 22 Baseline: Blind recomputation 17 minutes / patient (single-core VM) Runtime consistent across different phenotypes Changes to GeneMap/ClinVar have negligible impact on the execution time Run time [mm:ss] GeneMap version 2016-03-08 2016-04-28 2016-06-07 μ ± σ 17:05 ± 22 17:09 ± 15 17:10 ± 17
  • 23. ReComp–KeeleUniversity Dec.2016–P.Missier 23 Inside a single instance: Partial re-computation Change in ClinVar Change in GeneMap
  • 24. ReComp–KeeleUniversity Dec.2016–P.Missier 24 White-box granular provenance x11 x12 y11 P D11 D12 - Using provenance metadata to identify fragments of SVI that are affected by the change in reference data
  • 25. ReComp–KeeleUniversity Dec.2016–P.Missier 26 Results Run time [mm:ss] Saving s Run time [mm:ss] Saving s GeneMap version 2016-04-28 2016-06-07 μ ± σ 11:51 ± 16 31% 11:50 ± 20 31% ClinVar version 2016-02 2016-05 μ ± σ 9:51 ± 14 43% 9:50 ± 15 42% • How much can we save? • Process structure • First usage of reference data • Overhead: storing interim data required in partial re-execution • 20–22 MB for GeneMap changes and 2–334 kB for ClinVar changes
  • 26. ReComp–KeeleUniversity Dec.2016–P.Missier 27 Partial re-computation using input difference Idea: run SVI but replace ClinVar query with a query on ClinVar version diff: Q(CV)  Q(diff(CV1, CV2)) Works for SVI, but hard to generalise: depends on the type of process Bigger gain: diff(CV1, CV2) much smaller than CV2 GeneMap versions from –> to ToVersion rec. count Difference rec. count Reduction 16-03-08 –> 16-06-07 15910 1458 91% 16-03-08 –> 16-04-28 15871 1386 91% 16-04-28 –> 16-06-01 15897 78 99.5% 16-06-01 –> 16-06-02 15897 2 99.99% 16-06-02 –> 16-06-07 15910 33 99.8% ClinVar versions from –> to ToVersion rec. count Difference rec. count Reduction 15-02 –> 16-05 290815 38216 87% 15-02 –> 16-02 285042 35550 88% 16-02 –> 16-05 290815 3322 98.9%
  • 27. ReComp–KeeleUniversity Dec.2016–P.Missier 29 Saving resources on stream processing x1 x2 … xk xk+1 … x2k W1 W2 Raw stream windows P P y1 y2 … Wi+1 Wi Comp / noComp … yi-h-1 yi-h h<i P y’i yi-h yi Baseline stream processing Conditional stream processing - If we could predict that yi+1 will be similar to yi, we could skip computing P(Wi+1), save resources and instead deliver yi again - Can we make optimal comp/noComp decisions? What is required?
  • 28. ReComp–KeeleUniversity Dec.2016–P.Missier 30 Diff and currency functions the quality of yi is initially maximal, and decreases over time in a way that depends on how rapidly the new values yj diverge from yi.
  • 29. ReComp–KeeleUniversity Dec.2016–P.Missier 31 Measuring DeComp performance Evaluating the performance of comp / nocomp decisions on each window: Cost: - Very conservative DeComp computes every value: - Very optimistic, only computes first value: Boundary cases:
  • 31. ReComp–KeeleUniversity Dec.2016–P.Missier 33 Forecasting drift … Wi+1 Wi Comp / noComp … yi-h-1 yi-h h<i P y’i yi yi … Derived Time series drift forecasting
  • 32. ReComp–KeeleUniversity Dec.2016–P.Missier 34 Initial experiments: the DEBS’15 Taxi routes challenge Find the most frequent / most profitable taxi routes in Manhattan within each 30’ window VehicId,LicId, Pickup date, Drop off date, Dur,Dist,PickupLon, PickupLat,DropoffLon,DropofLat,Pay,Fare$, ... 0729...,E775...,2013-01-01 00:00:00,2013-01-01 00:02:00,120,0.44,-73.956528,40.716976,- 73.962440,40.715008,CSH, 3.50, ... 22D7...,3FF2...,2013-01-01 00:02:00,2013-01-01 00:02:00, 0,0.00, 0.000000, 0.000000, 0.000000, 0.000000,CSH,27.00, ... 0EC2...,778C...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.71,-73.973145,40.752827,- 73.965897,40.760445,CSH, 4.00, ... 1390...,BE31...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.48,-74.004173,40.720947,- 74.003838,40.726189,CSH, 4.00, ... 3B41...,7077...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.61,-73.987373,40.724861,- 73.983772,40.730995,CRD, 4.00, ... 5FAA...,00B7...,2013-01-01 00:02:00,2013-01-01 00:03:00, 60,0.00, 0.000000, 0.000000, 0.000000, 0.000000,CRD, 2.50, ... DFBF...,CF86...,2013-01-01 00:02:00,2013-01-01 00:03:00, 60,0.39,-73.981544,40.781475,- 73.979439,40.784386,CRD, 3.00, ... 1E5F...,E0B2...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00,-73.993973,40.751266, 0.000000, 0.000000,CSH, 2.50, ... 4682...,BB89...,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.71,-73.955383,40.779728,- 73.967758,40.760326,CSH, 6.50, ... 5F78...,B756...,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.21,-73.973000,40.793140,- 73.981453,40.778465,CRD, 6.00, ... 6BA2...,ED36...,2013-01-01 00:01:00,2013-01-01 00:04:00,180,0.74,-73.971138,40.758980,- 73.972206,40.752502,CRD, 4.50, ... 75C9...,00B7...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00, 0.000000, 0.000000, 0.000000, 0.000000,CRD, 3.00, ... C306...,E255...,2013-01-01 00:01:00,2013-01-01 00:04:00,180,0.84,-73.942841,40.797031,- 73.934540,40.797314,CSH, 4.50, ... C4D6...,95B5...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00,-73.989189,40.721924, 0.000000, 0.000000,CSH, 2.50, ...ta
  • 33. ReComp–KeeleUniversity Dec.2016–P.Missier 35 Diff time series – taxi routes Raw data stream st1  ft1, x1y1  x2y2 st2  ft2, x3y3  x4y2 . . . Routes time series ft1, R1 ft2, R2 . . ftn, R1 ftn+1, R1 ftn+2, R3 . . ftm, R2 ftm+1, R4 . . Top-k time series R1  Freq1 R2  Freq2 . . Rk  Freqk Rk+1  Freqk+1 Rk+2  Freqk+2 . . R2k  Freq2k R2k+1  Freq2k+1 . . W1 W2 W3 W1 W2 W3 => =>
  • 34. ReComp–KeeleUniversity Dec.2016–P.Missier 36 Routes drift – comparing ranked lists [1] Fagin, Ronald, Ravi Kumar, and D. Sivakumar. “Comparing Top K Lists.” SIAM Journal on Discrete Mathematics 17, no. 1 (January 2003): 134–60. doi:10.1137/S0895480102412856. P outputs a list of top most frequent/profitable routes To compare lists we use the generalised Kendall’s tau (Fagin et al. [1]) Quantify how much the top-k changes between one window and the next Input parameters determine stability / sensitivity: K: how many routes window size (e.g. 30’)
  • 35. ReComp–KeeleUniversity Dec.2016–P.Missier 370 0.2 0.4 0.6 0.8 1 1.2 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163 Drift function: top-10, window size: 2h, date range: [1/Jan 00:00–15/Jan 00:00) 0 0.2 0.4 0.6 0.8 1 1.2 Drift function: top-10, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00) 0 0.2 0.4 0.6 0.8 1 1.2 1 25 49 73 97 121 145 169 193 217 241 265 289 313 337 361 385 409 433 457 481 505 529 553 577 601 625 649 Drift function: top-10, window size: 30m, date range: [1/Jan 00:00–15/Jan 00:00)
  • 36. ReComp–KeeleUniversity Dec.2016–P.Missier 38 0 0.2 0.4 0.6 0.8 1 1.2 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 126 131 136 141 146 151 156 161 166 171 176 181 186 191 196 201 206 211 216 221 226 231 236 241 246 251 256 261 266 271 276 281 286 291 296 301 306 311 316 321 326 331 Drift function: top-40, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00) 0 0.2 0.4 0.6 0.8 1 1.2 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103109115121127133139145151157163169175181187193199205211217223229235241247253259265271277283289295301307313319325331 Drift function: top-20, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00) 0 0.2 0.4 0.6 0.8 1 1.2 1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 Drift function: top-10, window size: 1h, date range: [1/Jan 00:00–15/Jan 00:00)
  • 37. ReComp–KeeleUniversity Dec.2016–P.Missier 39 Approach: ARIMA forecasting 0 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Actual normalised drift vs ARIMA forecast Drift function: top-10, window size = 1h, date range = [20/Jan 00:00–25/Jan 17:00) new-day Actual norm-drift ARIMA(1,0,2)[1,0,1] forecast Drift prediction using time series forecasting • This is the derived diff() time series! • Autoregressive integrated moving average (ARIMA) • Widely used and well understood, well supported • Fast to compute • Assumes normality of underlying random variable Poor prediction: compute P too often or too rarely
  • 38. ReComp–KeeleUniversity Dec.2016–P.Missier 40 The next steps -- challenges • Can we learn effective surrogate models and estimators of change impact? • diff() functions, estimators seem very problem-specific • To what extent can the ReComp framework be made generic, reusable, yet still useful? • Metadata infrastructure: A DB of past executions history • Reproducibility: What really happens when I press the “ReComp” button?
  • 39. ReComp–KeeleUniversity Dec.2016–P.Missier 41 Summary and challenges Forwards: React to changes in data used by processes Backwards: restore value of knowledge outcomes Re-compute Selected outcomes Es0mate: - Benefit - Cost of refresh Quan0fy knowledge decay Es0mate: - Impact of changes - Cost of refresh Quan0fy data changes Monitor data changes Input, reference data versioning Op0mise / Priori0se Outcomes Knowledge outcomes Provenance, Cost New ground truth Data change events ReComp: a meta-process to observe and control underlying analytics processes
  • 40. ReComp–KeeleUniversity Dec.2016–P.Missier 42 ReComp scenarios ReComp scenario Target Impact areas Why is ReComp relevant? Proof of concept experiments Expected optimisation Dataflow, experimental science Genomics - Rapid Knowledge advances - Rapid scaling up of genetic testing at population level WES/SVI pipeline, workflow implementation (eScience Central) Timeliness and accuracy of patient diagnosis subject to budget constraints Time series analysis - Personal health monitoring - Smart city analytics - IoT data streams - Rapid data drift - Cost of computation at network edge (eg IoT) NYC taxi rides challenge (DEBS’15) Use of low-power edge devices when outcome is predictable and data drift is low Data layer optimisation Tuning of large-scale Data management stack Optimal Data organisation sensitive to current data profiles Graph DB re- partitioning System throughput vs cost of re-tuning Model learning Applications of predictive analytics Predictive models are very sensitive to data drift Twitter content analysis Sustained model predictive power over time vs retraining cost Simulation TBD repeated simulation. Computationally expensive but often not beneficial Flood modelling / CityCat Newcastle Computational resources vs marginal benefit of new simulation model
  • 41. ReComp–KeeleUniversity Dec.2016–P.Missier 43 Observability / transparency White box Black box Structure (static view) Dataflow - eScience Central, Taverna, VisTrails… Scripting: - R, Matlab, Python... - Functions semantics - Packaged components - Third party services Data dependencies (runtime view) Provenance recording: • Inputs, • Reference datasets, • Component versions, • Outputs • Input • Outputs • No data dependencies • No details on individual components Cost • Detailed resource monitoring • Cloud  £££ • Wall clock time • Service pricing • Setup time (eg model learning)
  • 42. ReComp–KeeleUniversity Dec.2016–P.Missier 44 Project structure • 3 years funding from the EPSRC (£585,000 grant) on the Making Sense from Data call • Feb. 2016 - Jan. 2019 • 2 RAs fully employed in Newcastle • PI: Dr. Missier, School of Computing Science, Newcastle University (30%) • CO-Investigators (8% each): • Prof. Watson, School of Computing Science, Newcastle University • Prof. Chinnery, Department of Clinical Neurosciences, Cambridge University • Dr. Phil James, Civil Engineering, Newcastle University Builds upon the experience of the Cloud-e-Genome project: 2013-2015 Aims: - To demonstrate cost-effective workflow-based processing of NGS pipelines on the cloud - To facilitate the adoption of reliable genetic testing in clinical practice - A collaboration between the Institute of Genetic Medicine and the School of Computing Science at Newcastle University - Funding: NIHR / Newcastle BRC (£180,000) plus $40,000 Microsoft Research grant “Azure for Research”

Notes de l'éditeur

  1. The times they are a’changin
  2. Each sample included 2-lane, pair-end raw sequence reads (4 files per sample).The average size of compressed files was nearly 15 GiB per sample; file decompression was included in the pipeline as one of the initial tasks.
  3. \noindent Program $P$ takes input $x$\\ depends on reference data resources $D = \{D_1 \ldots D_m\}$ \noindent Each execution $i: 1 \dots N$ operates:\\ - on a version of its input: $x_i^t$ \\ - on a state $d_j^t$ for each $D_j \in D$ \\ - with cost $c_i^t$ \begin{equation*} \langle y_i^t, c_i^t \rangle = \exec(P, x_i^t, \{ d_1^t \dots d_m^t\}) \end{equation*} \noindent \textbf{data version changes:}\\ - inputs $\update{x_i^{t'}}{x_i^t}$ \\ - dependencies: $\update{d_j^{t'}}{d_j^t}$: new release of $D_j$ at time $t'$. % \noindent \textbf{Diff functions:}\\ - $\diff{X}(x_i^t, x_i^{t'}) $ \\ - $\diff{Y}(y_i^t, y_i^{t'}) $ \\ - $\diff{D_j}(d_j^t,d_j^{t'}) $ e.g. added, removed, updated records
  4. \noindent At time $t$: \begin{equation*} \langle y_i^{t}, c_i^{t} \rangle = \exec(P, x_i^{t}, d^{t}) \label{eq:refreshed-exec} \end{equation*} \noindent At time $t' > t$, change in dependency $\update{d_j^{t'}}{d_j^t}$: \begin{equation*} \langle y_i^{t'}, c_i^{t'} \rangle = \exec(P, x_i^{t}, d^{t'}) \end{equation*} where $d^{t'} = \{ d_1^t \dots d_i^{t'} \dots d_m^t \}$ \noindent Impact of the change $\update{d_j^{t'}}{d_j^t}$: \begin{equation*} \mathit{imp}(\update{d_j^{t'}}{d_j^t}, y_i^t) = f_Y(\diff{Y}(y_i^t, y_i^{t'})) \in [0,1] \label{eq:imp} \end{equation*} where function $f_Y()$ is type-specific (and domain-specific)
  5. \langle y_i^{t}, c_i^{t} \rangle = \exec(P, x_i^{t}, d^{t}) \noindent $P$ = SVI \\ - one execution per patient, $i:1 \dots N$ \\ - $x_i = \langle \mathit{varset_i}, \mathit{ph_i} \rangle$ \\ - $\mathit{varset_i}$ patient's variants \\ $\mathit{ph_i} = \{ \dt_1, \dt_2, \dots \}$ patient's phenotype (\it{disease terms}) e.g. ``congenital myasthenic syndrome'’ \noindent SVI is a classifier:\\ \[\y = \{ (v, \mathit{class}) | v \in \mathit{varset}, \mathit{class} \in \{ \textsf{red}, \textsf{amber}, \textsf{green}\}\} \] \noindent reference datasets: $D = \{\OM, \CV\}$ \\ - $\OM$ = OMIM, $\CV$ = ClinVar \\ - $\OM = \{ \langle \dt, \genes(\dt) \rangle \} $ \\ - $\CV = \{ \langle v, g, \varst(v \rangle \}, \varst(v) \in \{ \textsf{unknown}, \textsf{benign}, \textsf{pathogenic} \} $
  6. returns updates in mappings to genes that have changed between the two versions (including possibly new mappings): $\diffOM(\OM^t, \OM^{t'}) = \{\langle t, genes(\dt) \rangle | genes(\dt) \neq genes'(\dt) \} $\\ where $genes'(\dt)$ is the new mapping for $\dt$ in $\OM^{t'}$. \begin{align*} \diffCV&(\CV^t, \CV^{t'}) = \\ &\{ \langle v, \varst(v) | \varst(v) \neq \varst'(v) \} \\ & \cup \CV^{t'} \setminus \CV^t \cup \CV^t \setminus \CV^{t'} \label{eq:diff-cv} \end{align*} where $\varst'(v)$ is the new class associate to $v$ in $\CV^{t'}$.
  7. \noindent - $O^t = \{ y_1^t, \dots y_N^t\}$: set of all outcomes that are current at time $t$\\ change $\update{d_j^{t'}}{d_j^t}$ (for simplicity) \noindent Select optimal $O_{rc}^t \subseteq O^t$ such that: \begin{equation*} \max_{O_{rc}^t \subset O^t} \sum_{y_i \in O_{rc}^t}\mathit{imp}(\update{d_j^{t'}}{d_j^t}, y_i^t) \text{,} \quad \sum_{i:1}^{N} c_i^{t'} \leq C \end{equation*} \begin{equation*} \{ \langle \imphat(\update{d_j^{t'}}{d_j^t}, y_i^t), \hat{c}_i^{t'} \rangle | y_i^t \in O^t\} \label{eq:imp-est} \end{equation*} \begin{align*} \max_{O_{rc}^t \subset O^t} \sum_{y_i \in O_{rc}^t}\imphat(\update{d_j^{t'}}{d_j^t}, y_i^t) \textbf{,} \quad \sum_{i:1}^{N} \hat{c}_i^{t'} \leq C \end{align*}
  8. \begin{equation*} \langle y_i^t, c_i^t \rangle = \exec(P, x_i^t, \{ d_1^t \dots d_m^t\}) \end{equation*} \begin{equation*} y = P(x) \end{equation*} $\update{x’}{x}$ $\diff{Y}(f(x), f(x'))$ $f'()$ such that $f'(x) = f(x) + \epsilon$ $ \diff{Y}(f(x), f(x')) \rightarrow \diff{Y}(f'(x), f'(x')) $
  9. \begin{equation*} \langle y_i^{t}, c_i^{t} \rangle = \exec(P, x_i^{t}, d^{t}) \label{eq:refreshed-exec} \end{equation*} $\update{x_i^{t'}}{x_i{^t}}$ $\update{d_j^{t'}}{d_j^t}$
  10. \diffd(D_i^v, D_i^{v'}) d_{ij} \in \diffd(D_i^v, D_i^{v'}) \texttt{used}(P_j, d_{ij}, [\texttt{prov:role} = \texttt{'dep'}]) \in \mathit{prov(\y^v)}
  11. - analyse(W) runs analytics on a windowed stream W1, W2, … - At time ti it produces and delivers output Oi = analyse(Wi) Requires REComp pramble \hat{y}_i = \begin{cases} y_i \text{~if $y_i = P(W_i)$ is computed,} \\ y_{i-k} \text{~otherwise} \end{cases}\\ \text{ where $y_{i-k}$ is the latest computed value} \\ Denote this value as $\mathit{surr}(y_i)$.
  12. $\diffo: O \times O \rightarrow [0,1] $ with: $\diffo(y_i, _i) = 0$ Quality function $q: O \times \mathbb{N} \rightarrow [0, q_{max}] $\\ $q(y_i, j)$ quantifies the currency of $y_i$ at time $j>i$. \\ \begin{equation*} q(y_i, j) = \begin{cases} q_{max} \text{~when $i=j$} \\ q_{max}- |\diffo(y_i, y_j)| \text{~otherwise} \end{cases} \end{equation*} \begin{equation*} \mathit{perf}(N) = \frac{\sum_{i:1}^{N}q(\mathit{surr(y_i), i)}}{N \cdot C} \end{equation*}
  13. $C = N \cdot c$, and \\ $\sum_{i:1}^{N}q(\mathit{act}(y_i), i) = N \cdot q_{max}$\\ because $act(y_i) = out_i$ for all $i$, thus \\ $\mathit{perf}(N) = \frac{q_{max}}{N \cdot c}$ ----- $q(\mathit{act}(y_i), i) = q(y_1, i) = q_{max} - \diffo(y_1, y_i)$\\ for each $i$, thus:\\ $\mathit{perf}(N) = \frac{1}{c} \Bigl[q_{max} - \frac{\sum_{i:1}^{N}\diffo(y_1, y_i)}{N} \Bigr ]$
  14. We used the generalised Kendall’s tau as a well recognised, generic method to compare top-k lists.