ReComp:Preserving the value of large scale data analytics over time through selective re-computation

ReComp–KeeleUniversity
Dec.2016–P.Missier
ReComp:
Preserving the value of large scale data analytics over time
through selective re-computation
recomp.org.uk
Paolo Missier, Jacek Cala, Manisha Rathi
School of Computing Science
Newcastle University
Keele University, Dec. 2016
(*) Painting by Johannes Moreelse
(*)
Panta Rhei (Heraclitus)

2
Data Science
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”

3
Data Science over time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t

4
Example: supervised learning
Meta-knowledge
Training
set
Model
learning
Classification
algorithms
Predictive
classifier
Background
Knowledge
(prior)
 the training set is no longer representative of current data  the model loses predictive power
Ex.: training set is a sample from social media stream (Twitter, Instagram, …)
• Incremental training: established (neural networks, Bayes classifiers, …)
• Incremental unlearning: some established work [1]
t
[1] Kidera, Takuya, Seiichi Ozawa, Shigeo Abe. “An Incremental Learning Algorithm of Ensemble Classifier Systems.” Neural
Networks, 2006, 6453–59. doi:10.1109/IJCNN.2006.247345.
[2] Polikar, R., L. Upda, S.S. Upda, and V. Honavar. “Learn++: An Incremental Learning Algorithm for Supervised Neural
Networks.” IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31, no. 4 (2001): 497–
508. doi:10.1109/5326.983933.
[3] Diehl, C.P., and G. Cauwenberghs. “SVM Incremental Learning, Adaptation and Optimization.” Proceedings of the
International Joint Conference on Neural Networks, 2003. 4, no. x (2003): 2685–90. doi:10.1109/IJCNN.2003.1223991.

5
Example: stream Analytics
Meta-knowledge
Data
stream
Time
Series
analysis
Pattern
recognition
algorithms
- Temporal Patterns
- Activity detection
- User behaviour
- …
Background
Knowledge
• If the output is stable over time, can I save computation and deliver older
outcomes instead?
• How do I quantify the quality/ cost trade-offs?

6
Analytics functions and their dependencies can be complex
Y = f(X, D) X inputs (vector of arbitrary data structures, “big data”)
D: vector of dependencies: libraries, reference data
Y outputs (vector of arbitrary data structures, “knowledge”)
Ex.:
machine learning
Using Python
and scikit-learn
Learn model to
recognise
activity pattern
Python 3
Ubuntu x.y.z
Azure VM
Model
training
Model
Scikit-learn
Numpy
Pandas
Ubuntu
on Azure
Dependencies
Training +
Testing
dataset
config
Ex.: workflow to
Identify mutations
in a patient’s
genome
Workflow
specification
WF manager
Linux VM
cluster on
Azure
Analyse
Input genome
variants
GATK/Picard/BWA
Workflow Manager
(and its own dependencies)
Ubuntu
on Azure
Dep.
Input
genome config
Ref
genome
Variants
DBs

7
Complex NGS pipelines
Recalibration
Corrects for system
bias on quality
scores assigned by
sequencer
GATK
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects
both SNV as well as longer
indels
Variant recalibration
attempts to reduce
false positive rate
from caller
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
recalibrate
alignments
calculate
coverage
coverage
informationraw
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants

8
Problem size: HPC vs Cloud deployment
Configuration: HPC cluster (dedicated nodes):
3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM,
160 GB scratch space
Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB
SSD
00:00
12:00
24:00
36:00
48:00
60:00
72:00
0 6 12 18 24
Responsetime[hh:mm]
Number of samples
3 eng (24 cores) 6 eng (48 cores)
12 eng (96 cores)
Big Data:
• raw sequences for Whole Exome Sequencing (WES): 5–20GB per patient
• processed in cohorts of 20–40 or close to 1 TB per cohort
• time required to process a 24-sample cohort can easily exceed 2 CPU months
• WES is about 2% of what the Whole Genome Sequencing analyses require

9
Understanding change: threats and opportunities
Big
Data
Life Sciences
Analytics
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
• Threats: Will any of the changes invalidate prior findings?
• Opportunities: Can the findings from the pipelines be improved over time?
• Cost: Need to model future costs based on past history and pricing trends for virtual appliances
• Impact analysis:
• Which patients/samples are likely to be affected?
• How do we estimate the potential benefits on affected patients?
• Can we estimate the impact of these changes without re-computing entire cohorts?
Changes:
• Algorithms and tools
• Accuracy of input sequences
• Reference databases (HGMD, ClinVar,
OMIM GeneMap, GeneCard,…)

10
ReComp
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce
(analytics) processes
Big
Data
Life Sciences
Analytics
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
A decision support system for selectively re-computing complex analytics in reaction to
change
- Generic: not just for the life sciences
- Customisable: eg for genomics pipelines

11
Challenges
3. Control How much control do we have on the system?
• Re-run: How often
• Total vs partial execution
• Input density / resolution / incremental update
• Eg nonmonotonic learning / unlearning
Change
Events
Diff(.,.)
functions
“business
Rules”
Optimal re-computation prioritisaton
Impact and Cost estimates
Reproducibility assessment
ReComp
Decision
Support
System
History of past
Knowledge Assets
1. Observability: To what extent can we observe the process and its execution?
• Process structure
• Data flow  provenance
2. Detecting and quantifying changes:
• In inputs, dependencies, outputs  diff() functions

12
General ReComp problem formulation

13
Change Impact

14
Example: NGS variant interpretation
Genomics: WES / WGS, Variant calling, Variant interpretation  diagnosis
- Eg 100K Genome Project, Genomics England, GeCIP
raw
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
recalibrate
alignments
calculate
coverage
coverage
informationraw
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Also: Metagenomics: Species identification. Eg The EBI metagenomics portal
Can help to confirm/reject a hypothesis of patient’s phenotype
Classifies variants into three categories: RED, GREEN, AMBER
pathogenic, benign and unknown/uncertain

15
The SVI example

16
Change in variant interpretation
What changes:
- Improved sequencing / variant calling
- ClinVar, OMIM evolve rapidly
- New reference data sources

17
ReComp Problem Statement
1. Estimate impact of changes
2. Optimise ReComp decisions: select subset of population that maximises
espected impact, subject to a budget constraint
Problem: P computationally expensive

18
Estimators: formalisation and a possible approach

And local changes
Problem: f() computationally expensive
Approach: learn an approximation f’() of f(): a surrogate (emulator)
Sensitivity Analysis:
Given
Assess
where ε is a stochastic term that accounts for the error in approximating f, and is typically
assumed to be Gaussian
Learning f’() requires a training set { (xi, yi) } …
If f’() can be found, then we can hope to use it to approximate:
which can then be used to carry out sensitivity analysis
For simplicity

19
Scope of change
2. Change: affects a single patient  partial re-run
May affect a subset of the patients population  scope
Which patients will be affected?
1. Change:

20
Challenge 1: battleships
Patient / change impact matrix
First challenge:
precisely identify the scope of a change
Blind reaction to change: recompute the entire matrix
Can we do better?
- Hit the high impact cases (the X) without re-
computing the entire matrix

21
SVI process: detailed design
Phenotype to genes
Variant selection
Variant classification
Patient
variants
GeneMap
ClinVar
Classified variants
Phenotype
hypothesis

22
Baseline: Blind recomputation
17 minutes / patient (single-core
VM)
Runtime consistent across different
phenotypes
Changes to GeneMap/ClinVar have
negligible impact on the execution
time
Run time [mm:ss]
GeneMap
version
2016-03-08 2016-04-28 2016-06-07
μ ± σ 17:05 ± 22 17:09 ± 15 17:10 ± 17

23
Inside a single instance: Partial re-computation
Change in
ClinVar
Change in
GeneMap

24
White-box granular provenance
x11
x12 y11
P
D11 D12
- Using provenance metadata to identify fragments of SVI that
are affected by the change in reference data

26
Results
Run time
[mm:ss]
Saving
s
Run time
[mm:ss]
Saving
s
GeneMap
version
2016-04-28 2016-06-07
μ ± σ 11:51 ± 16 31% 11:50 ± 20 31%
ClinVar
version
2016-02 2016-05
μ ± σ 9:51 ± 14 43% 9:50 ± 15 42%
• How much can we save?
• Process structure
• First usage of reference data
• Overhead: storing interim data required in partial re-execution
• 20–22 MB for GeneMap changes and 2–334 kB for ClinVar changes

27
Partial re-computation using input difference
Idea: run SVI but replace ClinVar query with a query on ClinVar version diff:
Q(CV)  Q(diff(CV1, CV2))
Works for SVI, but hard to generalise: depends on the type of process
Bigger gain: diff(CV1, CV2) much smaller than CV2
GeneMap versions
from –> to
ToVersion rec.
count
Difference
rec. count Reduction
16-03-08 –> 16-06-07 15910 1458 91%
16-03-08 –> 16-04-28 15871 1386 91%
16-04-28 –> 16-06-01 15897 78 99.5%
16-06-01 –> 16-06-02 15897 2 99.99%
16-06-02 –> 16-06-07 15910 33 99.8%
ClinVar versions
from –> to
ToVersion rec.
count
Difference
rec. count Reduction
15-02 –> 16-05 290815 38216 87%
15-02 –> 16-02 285042 35550 88%
16-02 –> 16-05 290815 3322 98.9%

29
Saving resources on stream processing
x1
x2
…
xk
xk+1
…
x2k
W1
W2
Raw
stream
windows
P
P
y1
y2
… Wi+1 Wi Comp / noComp
…
yi-h-1
yi-h h<i
P
y’i
yi-h
yi
Baseline stream processing
Conditional stream processing
- If we could predict that yi+1 will be similar to
yi, we could skip computing P(Wi+1), save
resources and instead deliver yi again
- Can we make optimal comp/noComp
decisions? What is required?

30
Diff and currency functions
the quality of yi is initially maximal, and decreases over time in a way that
depends on how rapidly the new values yj diverge from yi.

31
Measuring DeComp performance
Evaluating the performance of comp / nocomp decisions on each window:
Cost:
- Very conservative DeComp
computes every value:
- Very optimistic, only computes
first value:
Boundary cases:

32
Diff time series

33
Forecasting drift
… Wi+1 Wi Comp / noComp
…
yi-h-1
yi-h h<i
P
y’i
yi
yi …
Derived
Time series
drift
forecasting

34
Initial experiments: the DEBS’15 Taxi routes challenge
Find the most frequent / most profitable taxi routes in Manhattan
within each 30’ window
VehicId,LicId, Pickup date, Drop off date, Dur,Dist,PickupLon,
PickupLat,DropoffLon,DropofLat,Pay,Fare$, ...
0729...,E775...,2013-01-01 00:00:00,2013-01-01 00:02:00,120,0.44,-73.956528,40.716976,-
73.962440,40.715008,CSH, 3.50, ...
22D7...,3FF2...,2013-01-01 00:02:00,2013-01-01 00:02:00, 0,0.00, 0.000000, 0.000000, 0.000000,
0.000000,CSH,27.00, ...
0EC2...,778C...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.71,-73.973145,40.752827,-
73.965897,40.760445,CSH, 4.00, ...
1390...,BE31...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.48,-74.004173,40.720947,-
74.003838,40.726189,CSH, 4.00, ...
3B41...,7077...,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.61,-73.987373,40.724861,-
73.983772,40.730995,CRD, 4.00, ...
5FAA...,00B7...,2013-01-01 00:02:00,2013-01-01 00:03:00, 60,0.00, 0.000000, 0.000000, 0.000000,
0.000000,CRD, 2.50, ...
DFBF...,CF86...,2013-01-01 00:02:00,2013-01-01 00:03:00, 60,0.39,-73.981544,40.781475,-
73.979439,40.784386,CRD, 3.00, ...
1E5F...,E0B2...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00,-73.993973,40.751266, 0.000000,
0.000000,CSH, 2.50, ...
4682...,BB89...,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.71,-73.955383,40.779728,-
73.967758,40.760326,CSH, 6.50, ...
5F78...,B756...,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.21,-73.973000,40.793140,-
73.981453,40.778465,CRD, 6.00, ...
6BA2...,ED36...,2013-01-01 00:01:00,2013-01-01 00:04:00,180,0.74,-73.971138,40.758980,-
73.972206,40.752502,CRD, 4.50, ...
75C9...,00B7...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00, 0.000000, 0.000000, 0.000000,
0.000000,CRD, 3.00, ...
C306...,E255...,2013-01-01 00:01:00,2013-01-01 00:04:00,180,0.84,-73.942841,40.797031,-
73.934540,40.797314,CSH, 4.50, ...
C4D6...,95B5...,2013-01-01 00:03:00,2013-01-01 00:04:00, 60,0.00,-73.989189,40.721924, 0.000000,
0.000000,CSH, 2.50, ...ta

35
Diff time series – taxi routes
Raw data stream
st1  ft1, x1y1  x2y2
st2  ft2, x3y3  x4y2
.
.
.
Routes time series
ft1, R1
ft2, R2
.
.
ftn, R1
ftn+1, R1
ftn+2, R3
.
.
ftm, R2
ftm+1, R4
.
.
Top-k time series
R1  Freq1
R2  Freq2
.
.
Rk  Freqk
Rk+1  Freqk+1
Rk+2  Freqk+2
.
.
R2k  Freq2k
R2k+1  Freq2k+1
.
.
W1
W2
W3
W1
W2
W3
=> =>

36
Routes drift – comparing ranked lists
[1] Fagin, Ronald, Ravi Kumar, and D. Sivakumar. “Comparing Top K Lists.” SIAM Journal on
Discrete Mathematics 17, no. 1 (January 2003): 134–60. doi:10.1137/S0895480102412856.
P outputs a list of top most frequent/profitable routes
To compare lists we use the generalised Kendall’s tau (Fagin et al. [1])
Quantify how much the top-k changes between one window and the next
Input parameters determine stability / sensitivity:
K: how many routes
window size (e.g. 30’)

370
0.2
0.4
0.6
0.8
1
1.2
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163
Drift function: top-10, window size: 2h, date range: [1/Jan 00:00–15/Jan 00:00)
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
1 25 49 73 97 121 145 169 193 217 241 265 289 313 337 361 385 409 433 457 481 505 529 553 577 601 625 649
Drift function: top-10, window size: 30m, date range: [1/Jan 00:00–15/Jan 00:00)

38
0
0.2
0.4
0.6
0.8
1
1.2
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
101
106
111
116
121
126
131
136
141
146
151
156
161
166
171
176
181
186
191
196
201
206
211
216
221
226
231
236
241
246
251
256
261
266
271
276
281
286
291
296
301
306
311
316
321
326
331
0
0.2
0.4
0.6
0.8
1
1.2
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103109115121127133139145151157163169175181187193199205211217223229235241247253259265271277283289295301307313319325331
0
0.2
0.4
0.6
0.8
1
1.2
1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325

39
Approach: ARIMA forecasting
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
Actual normalised drift vs ARIMA forecast
Drift function: top-10, window size = 1h, date range = [20/Jan 00:00–25/Jan 17:00)
new-day Actual norm-drift ARIMA(1,0,2)[1,0,1] forecast
Drift prediction using time series forecasting
• This is the derived diff() time series!
• Autoregressive integrated moving average (ARIMA)
• Widely used and well understood, well supported
• Fast to compute
• Assumes normality of underlying random variable
Poor prediction: compute P too often or too rarely

40
The next steps -- challenges
• Can we learn effective surrogate models and estimators of change
impact?
• diff() functions, estimators seem very problem-specific
• To what extent can the ReComp framework be made generic,
reusable, yet still useful?
• Metadata infrastructure: A DB of past executions history
• Reproducibility: What really happens when I press the “ReComp”
button?

41
Summary and challenges
Forwards: React to changes
in data used by processes
Backwards: restore value
of knowledge outcomes
Re-compute
Selected outcomes
Es0mate:
- Benefit
- Cost of refresh
Quan0fy
knowledge
decay
Es0mate:
- Impact of changes
- Cost of refresh
Quan0fy
data
changes
Monitor data
changes
Input,
reference data
versioning
Op0mise /
Priori0se
Outcomes
Knowledge
outcomes
Provenance,
Cost
New ground
truth
Data change events
ReComp:
a meta-process to observe and control underlying analytics processes

42
ReComp scenarios
ReComp scenario Target Impact areas Why is ReComp
relevant?
Proof of concept
experiments
Expected
optimisation
Dataflow,
experimental
science
Genomics - Rapid Knowledge
advances
- Rapid scaling up
of genetic testing
at population level
WES/SVI pipeline,
workflow
implementation
(eScience Central)
Timeliness and
accuracy of patient
diagnosis subject to
budget constraints
Time series analysis - Personal health
monitoring
- Smart city
analytics
- IoT data streams
- Rapid data drift
- Cost of computation
at network edge (eg
IoT)
NYC taxi rides
challenge (DEBS’15)
Use of low-power
edge devices when
outcome is
predictable and data
drift is low
Data layer
optimisation
Tuning of large-scale
Data management
stack
Optimal Data
organisation sensitive
to current data
profiles
Graph DB re-
partitioning
System throughput vs
cost of re-tuning
Model learning Applications of
predictive analytics
Predictive models are
very sensitive to data
drift
Twitter content
analysis
Sustained model
predictive power over
time vs retraining
cost
Simulation TBD repeated simulation.
Computationally
expensive but often
not beneficial
Flood modelling /
CityCat Newcastle
Computational
resources vs
marginal benefit of
new simulation model

43
Observability / transparency
White box Black box
Structure
(static view)
Dataflow
- eScience Central, Taverna, VisTrails…
Scripting:
- R, Matlab, Python...
- Functions semantics
- Packaged components
- Third party services
Data
dependencies
(runtime
view)
Provenance recording:
• Inputs,
• Reference datasets,
• Component versions,
• Outputs
• Input
• Outputs
• No data dependencies
• No details on individual
components
Cost • Detailed resource monitoring
• Cloud  £££
• Wall clock time
• Service pricing
• Setup time (eg model
learning)

44
Project structure
• 3 years funding from the EPSRC (£585,000 grant) on the Making Sense from Data call
• Feb. 2016 - Jan. 2019
• 2 RAs fully employed in Newcastle
• PI: Dr. Missier, School of Computing Science, Newcastle University (30%)
• CO-Investigators (8% each):
• Prof. Watson, School of Computing Science, Newcastle University
• Prof. Chinnery, Department of Clinical Neurosciences, Cambridge University
• Dr. Phil James, Civil Engineering, Newcastle University
Builds upon the experience of the Cloud-e-Genome project: 2013-2015
Aims:
- To demonstrate cost-effective workflow-based processing of NGS pipelines on the cloud
- To facilitate the adoption of reliable genetic testing in clinical practice
- A collaboration between the Institute of Genetic Medicine and the School of Computing
Science at Newcastle University
- Funding: NIHR / Newcastle BRC (£180,000) plus $40,000 Microsoft Research grant “Azure
for Research”

ReComp:Preserving the value of large scale data analytics over time through selective re-computation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à ReComp:Preserving the value of large scale data analytics over time through selective re-computation

Similaire à ReComp:Preserving the value of large scale data analytics over time through selective re-computation (20)

Plus de Paolo Missier

Plus de Paolo Missier (20)

Dernier

Dernier (20)