IAC 2024 - IA Fast Track to Search Focused AI Solutions
Initial steps towards a production platform for DNA sequence analysis on the grid
1. Initial steps towards a production platform
for DNA sequence analysis on the grid
ISMB/ECCB conference – 18 July 2011
Barbera van Schaik, Angela Luyf, Michel de Vries,
Frank Baas, Antoine van Kampen and Silvia Olabarriaga
b.d.vanschaik@amc.uva.nl
2. Overview
Grid computing and workflow technology
Example: Virus discovery
Analysis of larger data sets
Example: Genome of the Netherlands
Challenges and summary
3. Sequencing, Moore’s law and personnel
Note:
Acceleration
Only slope is
meaningful in
this graph
http://www.politigenomics.com/2009/02/the-scale-up.html
4. What are the options?
Local cluster
Desktop grid
Super computer
Hadoop cluster
GPU cluster
Cloud computing
(Inter) national Grid Each system has its own interface
DNA computing Need to learn how they all work
National computing facilities
5. Grids
Distributed resources
Computing
Data storage
Open protocols
It's all about sharing
Resources
Methods
Collaborations
7. Sequence
facility People, resources and data flow
My role
Bioinformatics
NGS team
e-BioScience
team grid
Research
laboratories
8. Example: Virus discovery
VIDISCA
method
Virus discovery unit
exp1
exp1
exp1
exp1
exp1
exp1
exp6
exp1
exp1
exp3
exp2
exp1
GenBank - NR
Goal: Identify known and discover new viruses in samples
Michel de Vries et al (2011) PloS one
11. Run workflow on the grid
Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In Biomedicine
Tristan Glatard (2008) International Journal of High Performance Computing Applications
14. Speed up
exp1
exp1
exp1
exp1
exp1
exp1
exp6
exp1
exp1
Blast
exp3
exp2
exp1 2 databases:
Human ribosomal
15 experiments Viruses
722 samples
Total CPU time: 413 hrs (~17 days)
Elapsed time workflow: 13.7 hrs
= 30x speed up
Angela Luyf, Barbera van Schaik et al (2010) BMC Bioinformatics
15. Benefits workflow technology
Agile development
Re-use of components
Iteration strategy
Knowledge about analysis
steps captured in workflow
16. Analysis of larger data sets
Genome of the Netherlands (GoNL)
770 samples
Whole genome 45 TB raw data
sequencing of
Many partners
250 trios (data sharing)
Enrich biobanks Analysis on
distributed sites
Reference set for
disease studies http://www.bbmri.nl/
http://www.nlgenome.nl/
17. GoNL alignment pipeline
Pair1.fastq Reference
Pair2.fastq genome 160 samples (478 lanes) are
currently analyzed on the Dutch grid
BWA aln, sampe, sam-to-bam, sort bam, index
Development and small tests:
Picard mark duplicates Nov 22, 2010 - now
GATK realignment Analysis:
Mar 25, 2011 - Jul 15, 2011
Picard fix mates Jobs: 13,981
Total CPU time: 5.5 years
GATK recalibration Disk space used: 315 TB
Result.bam
Pipeline similar to what is used at the Broad Institute. Implemented for GoNL by Freerk van Dijk (Groningen)
18. Challenges
• Error handling
• Data management
• Data protection
• Provenance tracking
• Transparent addition of other resources
19. Summary
More research and development needed in e-bioscience
Latest IT infrastructures needed for scaling up NGS data
analysis (grids, clouds, big clusters)
Workflow technology assists agile implementation of
bioinformatics software
Separate workflow development from IT infrastructure for
easier migration and expansion (middleware)
20. Acknowledgements
Genome of the University of Amsterdam Bioinformatics Laboratory, AMC
Netherlands, NL Piter de Boer Antoine van Kampen
Cisca Wijmenga
Morris Swertz BiG Grid NGS bioinformatics team
All project partners Jan Just Keijser Aldo Jongejan
Tom Visser Marcel Willemsen
Virus discovery unit, AMC Grid support
Lia van der Hoek e-Bioscience team
Michel de Vries Modalis, France Silvia Olabarriaga
Johan Montagnat Angela Luyf
Department of Mark Santcroos
genome analysis, AMC Creatis, France Shayan Shahand
Frank Baas Tristan Glatard
Ted Bradley
Marja Jakobs
http://www.bioinformaticslaboratory.nl/