SlideShare une entreprise Scribd logo
1  sur  22
GenomeInformatics2013–P.Missier
From scripted HPC-based NGS pipelines to
workflows on the cloud
Jacek Cała, Yaobo Xu, Eldarina Azfar Wijaya, Paolo Missier
School of Computing Science and Institute of Genetic Medicine
Newcastle University, Newcastle upon Tyne, UK
C4Bio workshop @CCGrid 2014
Chicago, May 26th, 2014
C4Bio2014@CCGrid,-P.Missier
The Cloud-e-Genome project
NGS data processing:
provide mechanisms to rapidly and flexibly
create new exome sequence data processing
pipelines, and to deploy them in a scalable way;
Cost
Scalability
Flexibility
Data to insightHuman variant interpretation for clinical
diagnosis:
provide clinicians with a tool for analysis
and interpretation of human variants
• 2 year pilot project
• Funded by UK’s National Institute for Health Research (NIHR)
through the Biomedical Research Council (BRC)
• Nov. 2013: Cloud resources from Azure for Research Award
• 1 year’s worth of data/network/computing resources
Challenge:
to deliver the benefits of WES/WGS technology to clinical practice
C4Bio2014@CCGrid,-P.Missier
Key technical goals
• Scalability
• In the rate and number of patient sequence submissions
• In the density of sequence data (from whole exome to whole genome)
• Flexibility, Traceability, Comparability across versions
• Simplify experimenting with alternative pipelines (choice of tools, configuration
parameters)
• Trace each version and its executions
• Ability to compare results obtained using different pipelines and reason about
the differences
• Openness. Simplify the process of adding:
• New variant analysis tools
• New statistical methods for variant filtering, selection, and ranking
• Integration with third party databases
C4Bio2014@CCGrid,-P.Missier
Approach and testbed
Technical Approach:
• double- porting
• Infrastructure: HPC cluster to cloud (IaaS)
• Implementation: NGS pipelines from scripts to workflow
• Implement user tools for clinical diagnosis as cloud apps (SaaS)
Testbed and scale:
• Neurological patients from the North-East of England, focus on rare diseases
• Initial testing on about 300 sequences
• 2500-3000 sequences expected within 12 months
C4Bio2014@CCGrid,-P.Missier
Why port to workflow?
• Programming:
• Workflows provide better abstraction in the specification of pipelines
• Workflows directly executable by enactment engine
• Easier to understand, share, and maintain over time
• Flexible – relatively easy to introduce variations
• System: minimal installation/deployment requirements
• Fewer dedicated technical staff hours required
• Automated dependency management, packaging, deployment
• Extensible by wrapping new tools
• Exploits available data parallelism (but not automagically)
• Reproducibility
• Execution monitoring, provenance collection
• Persistence trace serves as evidence for data
• Amenable to automated analysis
C4Bio2014@CCGrid,-P.Missier
Scripted pipeline
Recalibration
Corrects for system
bias on quality
scores assigned by
sequencer
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects
both SNV as well as longer
indels
Variant recalibration
attempts to reduce
false positive rate
from caller
C4Bio2014@CCGrid,-P.Missier
From scripts to workflows
C4Bio2014@CCGrid,-P.Missier Workflow nesting
C4Bio2014@CCGrid,-P.Missier Pipeline evolution
Pipeline:
set C = { c1 … cn } of components -- tool wrappers
Each ci has a configuration conf(ci) and a version v(ci)
…and why
• Technology / algorithm evolution
• Traditional GATK variant caller 
GATK haplotype caller
• Does the interface change?
• Do the operational assumptions
change?
Eg. GATK Variant Recalibrator
requires large input data. Not suitable for
targeted sequencing
What can change
1 – Tool version:
v(ci)  v’(ci)
2 - Tool replacement / add / remove:
ci  c’I
3 – Configuration parameters
conf(ci)  conf’(ci)
(*) S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J.
Zschocke, and Z. Trajanoski, “A survey of tools for variant analysis of next-generation genome sequencing data.”
Briefings in bioinformatics, pp. bbs086–, Jan. 2013
Just for sequence alignment Pabinger et al. in their survey (*) list 17 aligners while
for variant annotation they refer over 70 tools
C4Bio2014@CCGrid,-P.Missier
Role of provenance
Provenance refers to the sources of information, including entities
and processes, involving in producing or delivering an artifact (*)
Provenance is a description of how things came to be, and how
they came to be in the state they are in today (*)
• Provenance is evidence in support of clinical diagnosis
1. Why do these variants appear in the output list?
2. Why have you concluded they are disease-causing?
• Requires ability to trace variants through workflow execution
• Simple scripting lacks this functionality
“Where do these variants come from?”
“Why do these results differ?”
C4Bio2014@CCGrid,-P.Missier Comparing results across pipeline configurations
Run pipeline version V1
V1  V2:
Replace BWA version
Modify Annovar configuration parameters
Variant list
VL1
Variant list
VL2Run pipeline version V2
??
Variant list
VL1
Variant list
VL2
DDIFF
(data differencing)
PDIFF
(provenance differencing)
Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing
for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013):
doi:10.1002/cpe.3035.
C4Bio2014@CCGrid,-P.Missier PDIFF - overview
WA
WB
C4Bio2014@CCGrid,-P.Missier The corresponding provenance traces
d1
S0
S0'
w h
S3 S2
y z
S4
x
k
S1
d2
d1'
S0
k'h'
S3'
S2v2
w'
S3
S4
y' z'
x'
S5
d2
(i) Trace A (ii) Trace B
P0 P1
P0 P1
P0 P0 P1P1
S Sv2
d0 d0
C4Bio2014@CCGrid,-P.Missier Delta graph computed by PDIFF
x, x
y, y z, z
w, w
k, k
S0 , S3
S0'
S3'
S1, S5
(service repl.)
S2, S2v2
(version change)
h, h
S0'
P0 branch of S4 P1 branch of S4
P0 branch of S2 P1 branch of S2
S,Sv2
(version change)
S0, S0
d1, d1
PDIFF helps determine the impact of
variations in the pipeline
C4Bio2014@CCGrid,-P.Missier
HPC Cluster configuration
16compute nodes
48/96GB RAM / 250GB disk
19TB usable storage space
Gigabit Ethernet
Shared resource for Institute-wide research
Submission script specifies node
/ core requirements
Computation waits until
resources are available
Current config:
• BWA alignment: 2 cores
• GATK: 8 cores
C4Bio2014@CCGrid,-P.Missier The case for cloud in genome informatics (*)
(*) Stein, Lincoln D. “The Case for Cloud Computing in Genome Informatics.” Genome
Biology 11, no. 5 (January 2010): 207.
• Storage + computing resources co-located
in a cloud
• Privacy issues
• Public, private, or hybrid
• Fluctuating demand  benefits from
elasticity
• Web-based access to clinicians simplifies
adoption
C4Bio2014@CCGrid,-P.Missier Workflow on Azure Cloud - configuration
<<Azure VM>>
Azure Blob
store
e-SC db
backend
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow engines
Top level workflow
Sub-workflows
Test configuration:
3 nodes, 24 cores
C4Bio2014@CCGrid,-P.Missier Workflow and sub-workflows execution
To e-SC queue To e-SC queue
Executable
Block
To e-SC queue
e-SC db
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow invocation executing on one engine (fragment)
C4Bio2014@CCGrid,-P.Missier Multi-sample processing
Sample list
[S1…Sk]
Top level
workflow
Variant files
[VCF1…VCFk]
Map semantics:
push K new
workflow
invocations to
the e-SC queue
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
BWA (S1)
BWA (S2)
…
C4Bio2014@CCGrid,-P.Missier Sub-workflows enqueued recursively
Exec block
Specifies threading
 OS maps
threads to
available cores
Sub-workflow
Gets added to
queue
Sub-workflow
One instance gets
added to queue for
each input sample
C4Bio2014@CCGrid,-P.Missier
Preliminary cost estimates
2 samples
1 x 8 core
30 hr @ £0.821 / h = £12.3 / sample
6 samples
3 x 8 core
47 hr @ £2.5 / h = £19 / sample
Cloud deployment makes cost easy to calculate
Trade-off:
• Better flexibility, scalability
• But loss of performance
Some tuning required
Cost model based on node uptime
Better node utilization
• Larger sample batches
Remove unnecessary wait time
• Make sub-workflows async
C4Bio2014@CCGrid,-P.Missier
Summary
• Whole-exome sequence processing on a cloud infrastructure
• Windows Azure – project sponsor
• Tracking provenance as evidence and for change analysis
• Porting HPC scripted pipeline to workflow model and technology
• Scalability, Flexibility, Evolvability

Contenu connexe

En vedette

Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
Paolo Missier
 
SWPM12 report on the dagstuhl seminar on Semantic Data Management
SWPM12 report on the dagstuhl seminar on Semantic Data Management SWPM12 report on the dagstuhl seminar on Semantic Data Management
SWPM12 report on the dagstuhl seminar on Semantic Data Management
Paolo Missier
 
Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...
Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...
Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...
Paolo Missier
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Paolo Missier
 

En vedette (17)

Ipaw12 datalog paper talk
Ipaw12 datalog paper talkIpaw12 datalog paper talk
Ipaw12 datalog paper talk
 
Invited talk @ DCC09 workshop
Invited talk @ DCC09 workshopInvited talk @ DCC09 workshop
Invited talk @ DCC09 workshop
 
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
 
Session talk @ AGU09
Session talk @ AGU09Session talk @ AGU09
Session talk @ AGU09
 
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
 
PDT: Personal Data from Things, and its provenance
PDT: Personal Data from Things,and its provenancePDT: Personal Data from Things,and its provenance
PDT: Personal Data from Things, and its provenance
 
SWPM12 report on the dagstuhl seminar on Semantic Data Management
SWPM12 report on the dagstuhl seminar on Semantic Data Management SWPM12 report on the dagstuhl seminar on Semantic Data Management
SWPM12 report on the dagstuhl seminar on Semantic Data Management
 
Structured Occurrence Network for provenance: talk for ipaw12 paper
Structured Occurrence Network for provenance: talk for ipaw12 paperStructured Occurrence Network for provenance: talk for ipaw12 paper
Structured Occurrence Network for provenance: talk for ipaw12 paper
 
Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...
Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...
Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...
 
ProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphsProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphs
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07
 

Similaire à C4Bio paper talk

Recording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid ServicesRecording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid Services
Martin Szomszor
 
Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...
njcar
 
Evaluating Cloud vs On-Premises for NGS Clinical Workflows
Evaluating Cloud vs On-Premises for NGS Clinical WorkflowsEvaluating Cloud vs On-Premises for NGS Clinical Workflows
Evaluating Cloud vs On-Premises for NGS Clinical Workflows
Golden Helix
 

Similaire à C4Bio paper talk (20)

QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Recording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid ServicesRecording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid Services
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
Principles of Reproducible Workflows (U-DAWS) nfcamp2019
Principles of Reproducible Workflows (U-DAWS) nfcamp2019Principles of Reproducible Workflows (U-DAWS) nfcamp2019
Principles of Reproducible Workflows (U-DAWS) nfcamp2019
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
 
Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
 
Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data models
 
Evaluating Cloud vs On-Premises for NGS Clinical Workflows
Evaluating Cloud vs On-Premises for NGS Clinical WorkflowsEvaluating Cloud vs On-Premises for NGS Clinical Workflows
Evaluating Cloud vs On-Premises for NGS Clinical Workflows
 
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
Launch Elixir BE 2017
Launch Elixir BE 2017Launch Elixir BE 2017
Launch Elixir BE 2017
 

Plus de Paolo Missier

Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

Plus de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

C4Bio paper talk

  • 1. GenomeInformatics2013–P.Missier From scripted HPC-based NGS pipelines to workflows on the cloud Jacek Cała, Yaobo Xu, Eldarina Azfar Wijaya, Paolo Missier School of Computing Science and Institute of Genetic Medicine Newcastle University, Newcastle upon Tyne, UK C4Bio workshop @CCGrid 2014 Chicago, May 26th, 2014
  • 2. C4Bio2014@CCGrid,-P.Missier The Cloud-e-Genome project NGS data processing: provide mechanisms to rapidly and flexibly create new exome sequence data processing pipelines, and to deploy them in a scalable way; Cost Scalability Flexibility Data to insightHuman variant interpretation for clinical diagnosis: provide clinicians with a tool for analysis and interpretation of human variants • 2 year pilot project • Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC) • Nov. 2013: Cloud resources from Azure for Research Award • 1 year’s worth of data/network/computing resources Challenge: to deliver the benefits of WES/WGS technology to clinical practice
  • 3. C4Bio2014@CCGrid,-P.Missier Key technical goals • Scalability • In the rate and number of patient sequence submissions • In the density of sequence data (from whole exome to whole genome) • Flexibility, Traceability, Comparability across versions • Simplify experimenting with alternative pipelines (choice of tools, configuration parameters) • Trace each version and its executions • Ability to compare results obtained using different pipelines and reason about the differences • Openness. Simplify the process of adding: • New variant analysis tools • New statistical methods for variant filtering, selection, and ranking • Integration with third party databases
  • 4. C4Bio2014@CCGrid,-P.Missier Approach and testbed Technical Approach: • double- porting • Infrastructure: HPC cluster to cloud (IaaS) • Implementation: NGS pipelines from scripts to workflow • Implement user tools for clinical diagnosis as cloud apps (SaaS) Testbed and scale: • Neurological patients from the North-East of England, focus on rare diseases • Initial testing on about 300 sequences • 2500-3000 sequences expected within 12 months
  • 5. C4Bio2014@CCGrid,-P.Missier Why port to workflow? • Programming: • Workflows provide better abstraction in the specification of pipelines • Workflows directly executable by enactment engine • Easier to understand, share, and maintain over time • Flexible – relatively easy to introduce variations • System: minimal installation/deployment requirements • Fewer dedicated technical staff hours required • Automated dependency management, packaging, deployment • Extensible by wrapping new tools • Exploits available data parallelism (but not automagically) • Reproducibility • Execution monitoring, provenance collection • Persistence trace serves as evidence for data • Amenable to automated analysis
  • 6. C4Bio2014@CCGrid,-P.Missier Scripted pipeline Recalibration Corrects for system bias on quality scores assigned by sequencer Computes coverage of each read. VCF Subsetting by filtering, eg non-exomic variants Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations Aligns sample sequence to HG19 reference genome using BWA aligner Cleaning, duplicate elimination Picard tools Variant calling operates on multiple samples simultaneously Splits samples into chunks. Haplotype caller detects both SNV as well as longer indels Variant recalibration attempts to reduce false positive rate from caller
  • 9. C4Bio2014@CCGrid,-P.Missier Pipeline evolution Pipeline: set C = { c1 … cn } of components -- tool wrappers Each ci has a configuration conf(ci) and a version v(ci) …and why • Technology / algorithm evolution • Traditional GATK variant caller  GATK haplotype caller • Does the interface change? • Do the operational assumptions change? Eg. GATK Variant Recalibrator requires large input data. Not suitable for targeted sequencing What can change 1 – Tool version: v(ci)  v’(ci) 2 - Tool replacement / add / remove: ci  c’I 3 – Configuration parameters conf(ci)  conf’(ci) (*) S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J. Zschocke, and Z. Trajanoski, “A survey of tools for variant analysis of next-generation genome sequencing data.” Briefings in bioinformatics, pp. bbs086–, Jan. 2013 Just for sequence alignment Pabinger et al. in their survey (*) list 17 aligners while for variant annotation they refer over 70 tools
  • 10. C4Bio2014@CCGrid,-P.Missier Role of provenance Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*) Provenance is a description of how things came to be, and how they came to be in the state they are in today (*) • Provenance is evidence in support of clinical diagnosis 1. Why do these variants appear in the output list? 2. Why have you concluded they are disease-causing? • Requires ability to trace variants through workflow execution • Simple scripting lacks this functionality “Where do these variants come from?” “Why do these results differ?”
  • 11. C4Bio2014@CCGrid,-P.Missier Comparing results across pipeline configurations Run pipeline version V1 V1  V2: Replace BWA version Modify Annovar configuration parameters Variant list VL1 Variant list VL2Run pipeline version V2 ?? Variant list VL1 Variant list VL2 DDIFF (data differencing) PDIFF (provenance differencing) Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013): doi:10.1002/cpe.3035.
  • 13. C4Bio2014@CCGrid,-P.Missier The corresponding provenance traces d1 S0 S0' w h S3 S2 y z S4 x k S1 d2 d1' S0 k'h' S3' S2v2 w' S3 S4 y' z' x' S5 d2 (i) Trace A (ii) Trace B P0 P1 P0 P1 P0 P0 P1P1 S Sv2 d0 d0
  • 14. C4Bio2014@CCGrid,-P.Missier Delta graph computed by PDIFF x, x y, y z, z w, w k, k S0 , S3 S0' S3' S1, S5 (service repl.) S2, S2v2 (version change) h, h S0' P0 branch of S4 P1 branch of S4 P0 branch of S2 P1 branch of S2 S,Sv2 (version change) S0, S0 d1, d1 PDIFF helps determine the impact of variations in the pipeline
  • 15. C4Bio2014@CCGrid,-P.Missier HPC Cluster configuration 16compute nodes 48/96GB RAM / 250GB disk 19TB usable storage space Gigabit Ethernet Shared resource for Institute-wide research Submission script specifies node / core requirements Computation waits until resources are available Current config: • BWA alignment: 2 cores • GATK: 8 cores
  • 16. C4Bio2014@CCGrid,-P.Missier The case for cloud in genome informatics (*) (*) Stein, Lincoln D. “The Case for Cloud Computing in Genome Informatics.” Genome Biology 11, no. 5 (January 2010): 207. • Storage + computing resources co-located in a cloud • Privacy issues • Public, private, or hybrid • Fluctuating demand  benefits from elasticity • Web-based access to clinicians simplifies adoption
  • 17. C4Bio2014@CCGrid,-P.Missier Workflow on Azure Cloud - configuration <<Azure VM>> Azure Blob store e-SC db backend <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine Workflow engines Top level workflow Sub-workflows Test configuration: 3 nodes, 24 cores
  • 18. C4Bio2014@CCGrid,-P.Missier Workflow and sub-workflows execution To e-SC queue To e-SC queue Executable Block To e-SC queue e-SC db <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine Workflow invocation executing on one engine (fragment)
  • 19. C4Bio2014@CCGrid,-P.Missier Multi-sample processing Sample list [S1…Sk] Top level workflow Variant files [VCF1…VCFk] Map semantics: push K new workflow invocations to the e-SC queue <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine BWA (S1) BWA (S2) …
  • 20. C4Bio2014@CCGrid,-P.Missier Sub-workflows enqueued recursively Exec block Specifies threading  OS maps threads to available cores Sub-workflow Gets added to queue Sub-workflow One instance gets added to queue for each input sample
  • 21. C4Bio2014@CCGrid,-P.Missier Preliminary cost estimates 2 samples 1 x 8 core 30 hr @ £0.821 / h = £12.3 / sample 6 samples 3 x 8 core 47 hr @ £2.5 / h = £19 / sample Cloud deployment makes cost easy to calculate Trade-off: • Better flexibility, scalability • But loss of performance Some tuning required Cost model based on node uptime Better node utilization • Larger sample batches Remove unnecessary wait time • Make sub-workflows async
  • 22. C4Bio2014@CCGrid,-P.Missier Summary • Whole-exome sequence processing on a cloud infrastructure • Windows Azure – project sponsor • Tracking provenance as evidence and for change analysis • Porting HPC scripted pipeline to workflow model and technology • Scalability, Flexibility, Evolvability

Notes de l'éditeur

  1. Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals. Azure: 10 L instances/ 24h a day. / 30 TB/year. / 10 GB of SQL Azure space / 30-­‐100 TB
  2. Coverage information translates into confidence on variant call Recalibration: quality score recalibration -- machine produces colour coding for the 4 aminocids, along with a p-value indicating the highest prob call; these are the Q scores different platforms give differnst system bias on Q scores -- and also depending on the lane. Each lane gives a different systematic bias. The point of recalibration is to correct for this type of bias
  3. E-Science Central Integrate multiple runtime environments - R, Octave, Java, Javascript, (Perl)
  4. Traditional Variant Callers Go through the whole genome to identify locations where a number of non-reference bases appears to call SNPs Gapped mapping to identify INDELs Different algorithms to calculate SNP and INDELs likelihoods GATK HaplotypeCaller Haplotype-based calling Call SNPs and indels simultaneously by performing a local de-novo assembly Same algorithm for SNPs and Indels likelyhoods Artifacts caused by large INDELs recovered by assembly
  5. We have seen some examples of the look and feel of e-SC. Now we briefly go over the architecture. SaaS – Science as a Service
  6. Config is a trade off – shared resource has limited room for scalability. Peak times
  7. Model currently is sync execution
  8. Overall Azure/lampredi ratio is about 1.4 (Azure solution is 40% slower than lampredi) -- this is based on the pipeline run up GATK-phase3 (excluding), so BWA-picard-GATK1-VariantCalling. BWA-picard-GATK1-VariantCalling takes over 90% of the total time of the pipeline, so approximation should be ok. our input in compressed form is (13.85 GiB on average).