Initial steps towards a production platform for DNA sequence analysis on the grid

Initial steps towards a production platform
for DNA sequence analysis on the grid

ISMB/ECCB conference – 18 July 2011

Barbera van Schaik, Angela Luyf, Michel de Vries,
Frank Baas, Antoine van Kampen and Silvia Olabarriaga

b.d.vanschaik@amc.uva.nl

Overview

Grid computing and workflow technology
Example: Virus discovery

Analysis of larger data sets
Example: Genome of the Netherlands

Challenges and summary

Sequencing, Moore’s law and personnel

Note:
Acceleration

Only slope is
meaningful in
this graph

http://www.politigenomics.com/2009/02/the-scale-up.html

What are the options?
Local cluster
Desktop grid
Super computer
Hadoop cluster
GPU cluster
Cloud computing
(Inter) national Grid Each system has its own interface
DNA computing Need to learn how they all work
National computing facilities

Grids
Distributed resources

Computing
Data storage

Open protocols

It's all about sharing

Resources
Methods
Collaborations

Dutch grid (resources)

grid

http://www.biggrid.nl/

Sequence
facility People, resources and data flow
My role

Bioinformatics
NGS team
e-BioScience
team grid
Research
laboratories

Example: Virus discovery
VIDISCA
method
Virus discovery unit

exp1
exp1
exp1
exp1
exp1
exp1
exp6
exp1
exp1
exp3
exp2
exp1
GenBank - NR

Goal: Identify known and discover new viruses in samples
Michel de Vries et al (2011) PloS one

BLAST analysis workflow

Input: sequence reads

Conversion step (sff to fasta)

BLAST

Output: BLAST results

Implementation of workflow components
Workflow description (XML)
In: sequences In: sequences In: database
(sff) (fasta) X (fasta)

Component 1 (XML) Component 2 (XML)
Executable/script: Executable/script:
sff2fasta.pl BLAST

Out: sequences Out: blast result
(fasta) (txt)
Tristan Glatard (2008) Future generation computer systems
http://gwendia.i3s.unice.fr/doku.php?id=gwendia

Run workflow on the grid

Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In Biomedicine
Tristan Glatard (2008) International Journal of High Performance Computing Applications

Graphical user interface: VBrowser

http://www.vl-e.nl/vbrowser

Speed up
exp1
exp1
exp1
exp1
exp1
exp1
exp6
exp1
exp1
Blast
exp3
exp2
exp1 2 databases:
Human ribosomal
15 experiments Viruses
722 samples

Total CPU time: 413 hrs (~17 days)
Elapsed time workflow: 13.7 hrs
= 30x speed up
Angela Luyf, Barbera van Schaik et al (2010) BMC Bioinformatics

Benefits workflow technology

Agile development

Re-use of components

Iteration strategy

Knowledge about analysis
steps captured in workflow

Analysis of larger data sets
Genome of the Netherlands (GoNL)

770 samples
Whole genome 45 TB raw data
sequencing of
Many partners
250 trios (data sharing)

Enrich biobanks Analysis on
distributed sites
Reference set for
disease studies http://www.bbmri.nl/
http://www.nlgenome.nl/

GoNL alignment pipeline
Pair1.fastq Reference
Pair2.fastq genome 160 samples (478 lanes) are
currently analyzed on the Dutch grid
BWA aln, sampe, sam-to-bam, sort bam, index
Development and small tests:
Picard mark duplicates Nov 22, 2010 - now

GATK realignment Analysis:
Mar 25, 2011 - Jul 15, 2011
Picard fix mates Jobs: 13,981
Total CPU time: 5.5 years
GATK recalibration Disk space used: 315 TB

Result.bam
Pipeline similar to what is used at the Broad Institute. Implemented for GoNL by Freerk van Dijk (Groningen)

Challenges

• Error handling
• Data management
• Data protection
• Provenance tracking
• Transparent addition of other resources

Summary
More research and development needed in e-bioscience

Latest IT infrastructures needed for scaling up NGS data
analysis (grids, clouds, big clusters)

Workflow technology assists agile implementation of
bioinformatics software

Separate workflow development from IT infrastructure for
easier migration and expansion (middleware)

Acknowledgements
Genome of the University of Amsterdam Bioinformatics Laboratory, AMC
Netherlands, NL Piter de Boer Antoine van Kampen
Cisca Wijmenga
Morris Swertz BiG Grid NGS bioinformatics team
All project partners Jan Just Keijser Aldo Jongejan
Tom Visser Marcel Willemsen
Virus discovery unit, AMC Grid support
Lia van der Hoek e-Bioscience team
Michel de Vries Modalis, France Silvia Olabarriaga
Johan Montagnat Angela Luyf
Department of Mark Santcroos
genome analysis, AMC Creatis, France Shayan Shahand
Frank Baas Tristan Glatard
Ted Bradley
Marja Jakobs

http://www.bioinformaticslaboratory.nl/

BWA on grid – component description

22

BWA on grid – component description

23

BWA on grid – workflow description

24

http://orange.ebioscience.amc.nl/ebioinfragateway/
e-BioInfra gateway
No grid certificate needed
Data upload via sFTP (intranet)
Synced with grid storage
Workflows are started from web page

Implemented workflow components
for next generation sequencing

Existing software In-house software
• BLAST • Roche software • Data format converters
• BLAT • GATK • Quality trimming
• BWA • Picard • Alternative splice product detection
• Annovar • Samtools • CDR3 detection (T- and B-cell variation)
• Varscan • Genome comparison (small genomes)
• Newbler
• FastQC

Initial steps towards a production platform for DNA sequence analysis on the grid

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à Initial steps towards a production platform for DNA sequence analysis on the grid

Similaire à Initial steps towards a production platform for DNA sequence analysis on the grid (20)

Dernier

Dernier (20)

Initial steps towards a production platform for DNA sequence analysis on the grid