AWS Community Day CPH - Three problems of Terraform
C4Bio paper talk
1. GenomeInformatics2013–P.Missier
From scripted HPC-based NGS pipelines to
workflows on the cloud
Jacek Cała, Yaobo Xu, Eldarina Azfar Wijaya, Paolo Missier
School of Computing Science and Institute of Genetic Medicine
Newcastle University, Newcastle upon Tyne, UK
C4Bio workshop @CCGrid 2014
Chicago, May 26th, 2014
2. C4Bio2014@CCGrid,-P.Missier
The Cloud-e-Genome project
NGS data processing:
provide mechanisms to rapidly and flexibly
create new exome sequence data processing
pipelines, and to deploy them in a scalable way;
Cost
Scalability
Flexibility
Data to insightHuman variant interpretation for clinical
diagnosis:
provide clinicians with a tool for analysis
and interpretation of human variants
• 2 year pilot project
• Funded by UK’s National Institute for Health Research (NIHR)
through the Biomedical Research Council (BRC)
• Nov. 2013: Cloud resources from Azure for Research Award
• 1 year’s worth of data/network/computing resources
Challenge:
to deliver the benefits of WES/WGS technology to clinical practice
3. C4Bio2014@CCGrid,-P.Missier
Key technical goals
• Scalability
• In the rate and number of patient sequence submissions
• In the density of sequence data (from whole exome to whole genome)
• Flexibility, Traceability, Comparability across versions
• Simplify experimenting with alternative pipelines (choice of tools, configuration
parameters)
• Trace each version and its executions
• Ability to compare results obtained using different pipelines and reason about
the differences
• Openness. Simplify the process of adding:
• New variant analysis tools
• New statistical methods for variant filtering, selection, and ranking
• Integration with third party databases
4. C4Bio2014@CCGrid,-P.Missier
Approach and testbed
Technical Approach:
• double- porting
• Infrastructure: HPC cluster to cloud (IaaS)
• Implementation: NGS pipelines from scripts to workflow
• Implement user tools for clinical diagnosis as cloud apps (SaaS)
Testbed and scale:
• Neurological patients from the North-East of England, focus on rare diseases
• Initial testing on about 300 sequences
• 2500-3000 sequences expected within 12 months
5. C4Bio2014@CCGrid,-P.Missier
Why port to workflow?
• Programming:
• Workflows provide better abstraction in the specification of pipelines
• Workflows directly executable by enactment engine
• Easier to understand, share, and maintain over time
• Flexible – relatively easy to introduce variations
• System: minimal installation/deployment requirements
• Fewer dedicated technical staff hours required
• Automated dependency management, packaging, deployment
• Extensible by wrapping new tools
• Exploits available data parallelism (but not automagically)
• Reproducibility
• Execution monitoring, provenance collection
• Persistence trace serves as evidence for data
• Amenable to automated analysis
6. C4Bio2014@CCGrid,-P.Missier
Scripted pipeline
Recalibration
Corrects for system
bias on quality
scores assigned by
sequencer
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects
both SNV as well as longer
indels
Variant recalibration
attempts to reduce
false positive rate
from caller
9. C4Bio2014@CCGrid,-P.Missier Pipeline evolution
Pipeline:
set C = { c1 … cn } of components -- tool wrappers
Each ci has a configuration conf(ci) and a version v(ci)
…and why
• Technology / algorithm evolution
• Traditional GATK variant caller
GATK haplotype caller
• Does the interface change?
• Do the operational assumptions
change?
Eg. GATK Variant Recalibrator
requires large input data. Not suitable for
targeted sequencing
What can change
1 – Tool version:
v(ci) v’(ci)
2 - Tool replacement / add / remove:
ci c’I
3 – Configuration parameters
conf(ci) conf’(ci)
(*) S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J.
Zschocke, and Z. Trajanoski, “A survey of tools for variant analysis of next-generation genome sequencing data.”
Briefings in bioinformatics, pp. bbs086–, Jan. 2013
Just for sequence alignment Pabinger et al. in their survey (*) list 17 aligners while
for variant annotation they refer over 70 tools
10. C4Bio2014@CCGrid,-P.Missier
Role of provenance
Provenance refers to the sources of information, including entities
and processes, involving in producing or delivering an artifact (*)
Provenance is a description of how things came to be, and how
they came to be in the state they are in today (*)
• Provenance is evidence in support of clinical diagnosis
1. Why do these variants appear in the output list?
2. Why have you concluded they are disease-causing?
• Requires ability to trace variants through workflow execution
• Simple scripting lacks this functionality
“Where do these variants come from?”
“Why do these results differ?”
11. C4Bio2014@CCGrid,-P.Missier Comparing results across pipeline configurations
Run pipeline version V1
V1 V2:
Replace BWA version
Modify Annovar configuration parameters
Variant list
VL1
Variant list
VL2Run pipeline version V2
??
Variant list
VL1
Variant list
VL2
DDIFF
(data differencing)
PDIFF
(provenance differencing)
Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing
for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013):
doi:10.1002/cpe.3035.
13. C4Bio2014@CCGrid,-P.Missier The corresponding provenance traces
d1
S0
S0'
w h
S3 S2
y z
S4
x
k
S1
d2
d1'
S0
k'h'
S3'
S2v2
w'
S3
S4
y' z'
x'
S5
d2
(i) Trace A (ii) Trace B
P0 P1
P0 P1
P0 P0 P1P1
S Sv2
d0 d0
14. C4Bio2014@CCGrid,-P.Missier Delta graph computed by PDIFF
x, x
y, y z, z
w, w
k, k
S0 , S3
S0'
S3'
S1, S5
(service repl.)
S2, S2v2
(version change)
h, h
S0'
P0 branch of S4 P1 branch of S4
P0 branch of S2 P1 branch of S2
S,Sv2
(version change)
S0, S0
d1, d1
PDIFF helps determine the impact of
variations in the pipeline
15. C4Bio2014@CCGrid,-P.Missier
HPC Cluster configuration
16compute nodes
48/96GB RAM / 250GB disk
19TB usable storage space
Gigabit Ethernet
Shared resource for Institute-wide research
Submission script specifies node
/ core requirements
Computation waits until
resources are available
Current config:
• BWA alignment: 2 cores
• GATK: 8 cores
16. C4Bio2014@CCGrid,-P.Missier The case for cloud in genome informatics (*)
(*) Stein, Lincoln D. “The Case for Cloud Computing in Genome Informatics.” Genome
Biology 11, no. 5 (January 2010): 207.
• Storage + computing resources co-located
in a cloud
• Privacy issues
• Public, private, or hybrid
• Fluctuating demand benefits from
elasticity
• Web-based access to clinicians simplifies
adoption
17. C4Bio2014@CCGrid,-P.Missier Workflow on Azure Cloud - configuration
<<Azure VM>>
Azure Blob
store
e-SC db
backend
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow engines
Top level workflow
Sub-workflows
Test configuration:
3 nodes, 24 cores
18. C4Bio2014@CCGrid,-P.Missier Workflow and sub-workflows execution
To e-SC queue To e-SC queue
Executable
Block
To e-SC queue
e-SC db
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow invocation executing on one engine (fragment)
19. C4Bio2014@CCGrid,-P.Missier Multi-sample processing
Sample list
[S1…Sk]
Top level
workflow
Variant files
[VCF1…VCFk]
Map semantics:
push K new
workflow
invocations to
the e-SC queue
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
BWA (S1)
BWA (S2)
…
20. C4Bio2014@CCGrid,-P.Missier Sub-workflows enqueued recursively
Exec block
Specifies threading
OS maps
threads to
available cores
Sub-workflow
Gets added to
queue
Sub-workflow
One instance gets
added to queue for
each input sample
21. C4Bio2014@CCGrid,-P.Missier
Preliminary cost estimates
2 samples
1 x 8 core
30 hr @ £0.821 / h = £12.3 / sample
6 samples
3 x 8 core
47 hr @ £2.5 / h = £19 / sample
Cloud deployment makes cost easy to calculate
Trade-off:
• Better flexibility, scalability
• But loss of performance
Some tuning required
Cost model based on node uptime
Better node utilization
• Larger sample batches
Remove unnecessary wait time
• Make sub-workflows async
22. C4Bio2014@CCGrid,-P.Missier
Summary
• Whole-exome sequence processing on a cloud infrastructure
• Windows Azure – project sponsor
• Tracking provenance as evidence and for change analysis
• Porting HPC scripted pipeline to workflow model and technology
• Scalability, Flexibility, Evolvability
Notes de l'éditeur
Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals.
Azure: 10 L instances/ 24h a day. / 30 TB/year. / 10 GB of SQL Azure space / 30-‐100 TB
Coverage information translates into confidence on variant call
Recalibration:
quality score recalibration --
machine produces colour coding for the 4 aminocids, along with a p-value indicating the highest prob call; these are the Q scores
different platforms give differnst system bias on Q scores -- and also depending on the lane. Each lane gives a different systematic bias. The point of recalibration is to correct for this type of bias
Traditional Variant Callers
Go through the whole genome to identify locations where a number of non-reference bases appears to call SNPs
Gapped mapping to identify INDELs
Different algorithms to calculate SNP and INDELs likelihoods
GATK HaplotypeCaller
Haplotype-based calling
Call SNPs and indels simultaneously by performing a local de-novo assembly
Same algorithm for SNPs and Indels likelyhoods
Artifacts caused by large INDELs recovered by assembly
We have seen some examples of the look and feel of e-SC.
Now we briefly go over the architecture.
SaaS – Science as a Service
Config is a trade off – shared resource has limited room for scalability. Peak times
Model currently is sync execution
Overall Azure/lampredi ratio is about 1.4 (Azure solution is 40% slower than lampredi) -- this is based on the pipeline run up GATK-phase3 (excluding), so BWA-picard-GATK1-VariantCalling.
BWA-picard-GATK1-VariantCalling takes over 90% of the total time of the pipeline, so approximation should be ok.
our input in compressed form is (13.85 GiB on average).