SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Initial steps towards a production platform
   for DNA sequence analysis on the grid

           ISMB/ECCB conference – 18 July 2011

      Barbera van Schaik, Angela Luyf, Michel de Vries,
   Frank Baas, Antoine van Kampen and Silvia Olabarriaga

                b.d.vanschaik@amc.uva.nl
Overview

Grid computing and workflow technology
        Example: Virus discovery

     Analysis of larger data sets
 Example: Genome of the Netherlands

       Challenges and summary
Sequencing, Moore’s law and personnel



                                                                                Note:
Acceleration




                                                                            Only slope is
                                                                            meaningful in
                                                                             this graph




                  http://www.politigenomics.com/2009/02/the-scale-up.html
What are the options?
Local cluster
Desktop grid
Super computer
Hadoop cluster
GPU cluster
Cloud computing
(Inter) national Grid     Each system has its own interface
DNA computing             Need to learn how they all work
National computing facilities
Grids
    Distributed resources

             Computing
             Data storage


    Open protocols


    It's all about sharing

             Resources
             Methods
             Collaborations
Dutch grid (resources)




                               grid




http://www.biggrid.nl/
Sequence
   facility         People, resources and data flow
                         My role




               Bioinformatics
                 NGS team
                                e-BioScience
                                    team       grid
  Research
laboratories
Example: Virus discovery
VIDISCA
method
                                              Virus discovery unit

                      exp1
                    exp1
                       exp1
                   exp1
                        exp1
                   exp1
                         exp6
                 exp1
                     exp1
                           exp3
                       exp2
                  exp1
                                                    GenBank - NR

Goal: Identify known and discover new viruses in samples
                                        Michel de Vries et al (2011) PloS one
BLAST analysis workflow

    Input: sequence reads


 Conversion step (sff to fasta)


            BLAST


    Output: BLAST results
Implementation of workflow components
 Workflow description (XML)
        In: sequences                               In: sequences   In: database
              (sff)                                      (fasta)  X     (fasta)

      Component 1 (XML)                                     Component 2 (XML)
        Executable/script:                                   Executable/script:
           sff2fasta.pl                                          BLAST


           Out: sequences                                     Out: blast result
               (fasta)                                              (txt)
Tristan Glatard (2008) Future generation computer systems
http://gwendia.i3s.unice.fr/doku.php?id=gwendia
Run workflow on the grid




Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In Biomedicine
Tristan Glatard (2008) International Journal of High Performance Computing Applications
Graphical user interface: VBrowser




                                     http://www.vl-e.nl/vbrowser
Workflow monitoring
Speed up
                    exp1
                  exp1
                     exp1
                 exp1
                      exp1
                 exp1
                       exp6
               exp1
                   exp1
                                                      Blast
                         exp3
                     exp2
                exp1                                                      2 databases:
                                                                        Human ribosomal
           15 experiments                                                    Viruses
            722 samples

                                                             Total CPU time: 413 hrs (~17 days)
                                                             Elapsed time workflow: 13.7 hrs
                                                             = 30x speed up
Angela Luyf, Barbera van Schaik et al (2010) BMC Bioinformatics
Benefits workflow technology

         Agile development

       Re-use of components

         Iteration strategy

      Knowledge about analysis
     steps captured in workflow
Analysis of larger data sets
          Genome of the Netherlands (GoNL)

                                       770 samples
Whole genome                           45 TB raw data
sequencing of
                                       Many partners
250 trios                              (data sharing)

Enrich biobanks                        Analysis on
                                       distributed sites
Reference set for
disease studies                              http://www.bbmri.nl/
                                             http://www.nlgenome.nl/
GoNL alignment pipeline
      Pair1.fastq                  Reference
      Pair2.fastq                  genome                     160 samples (478 lanes) are
                                                              currently analyzed on the Dutch grid
BWA aln, sampe, sam-to-bam, sort bam, index
                                                              Development and small tests:
            Picard mark duplicates                            Nov 22, 2010 - now

               GATK realignment                               Analysis:
                                                              Mar 25, 2011 - Jul 15, 2011
                 Picard fix mates                             Jobs: 13,981
                                                              Total CPU time: 5.5 years
               GATK recalibration                             Disk space used: 315 TB

                    Result.bam
 Pipeline similar to what is used at the Broad Institute. Implemented for GoNL by Freerk van Dijk (Groningen)
Challenges

•   Error handling
•   Data management
•   Data protection
•   Provenance tracking
•   Transparent addition of other resources
Summary
More research and development needed in e-bioscience

Latest IT infrastructures needed for scaling up NGS data
   analysis (grids, clouds, big clusters)

Workflow technology assists agile implementation of
 bioinformatics software

Separate workflow development from IT infrastructure for
  easier migration and expansion (middleware)
Acknowledgements
Genome of the                    University of Amsterdam   Bioinformatics Laboratory, AMC
Netherlands, NL                  Piter de Boer             Antoine van Kampen
Cisca Wijmenga
Morris Swertz                    BiG Grid                  NGS bioinformatics team
All project partners             Jan Just Keijser          Aldo Jongejan
                                 Tom Visser                Marcel Willemsen
Virus discovery unit, AMC        Grid support
Lia van der Hoek                                           e-Bioscience team
Michel de Vries                  Modalis, France           Silvia Olabarriaga
                                 Johan Montagnat           Angela Luyf
Department of                                              Mark Santcroos
genome analysis, AMC             Creatis, France           Shayan Shahand
Frank Baas                       Tristan Glatard
Ted Bradley
Marja Jakobs




                       http://www.bioinformaticslaboratory.nl/
BWA on grid – component description




                           22
BWA on grid – component description




                           23
BWA on grid – workflow description




                            24
http://orange.ebioscience.amc.nl/ebioinfragateway/
                   e-BioInfra gateway
No grid certificate needed
Data upload via sFTP (intranet)
Synced with grid storage
Workflows are started from web page
Implemented workflow components
       for next generation sequencing

Existing software                     In-house software
• BLAST          •   Roche software   • Data format converters
• BLAT           •   GATK             • Quality trimming
• BWA            •   Picard           • Alternative splice product detection
• Annovar        •   Samtools         • CDR3 detection (T- and B-cell variation)
• Varscan                             • Genome comparison (small genomes)
• Newbler
• FastQC

Contenu connexe

Tendances

Swertz bosc2010 molgenis
Swertz bosc2010 molgenisSwertz bosc2010 molgenis
Swertz bosc2010 molgenisBOSC 2010
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...Jan Aerts
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformaticianChristian Frech
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitBOSC 2010
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Torsten Seemann
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizeAnn Loraine
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
Caporaso sloan qiime_workshop_slides_18_oct2012
Caporaso sloan qiime_workshop_slides_18_oct2012Caporaso sloan qiime_workshop_slides_18_oct2012
Caporaso sloan qiime_workshop_slides_18_oct2012gregcaporaso
 
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3GenomeInABottle
 
Aug2015 Giab nist integration methods
Aug2015 Giab nist integration methodsAug2015 Giab nist integration methods
Aug2015 Giab nist integration methodsGenomeInABottle
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platformsAllSeq
 

Tendances (20)

Swertz bosc2010 molgenis
Swertz bosc2010 molgenisSwertz bosc2010 molgenis
Swertz bosc2010 molgenis
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linked2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linked
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
RNA-Seq with R-Bioconductor
RNA-Seq with R-BioconductorRNA-Seq with R-Bioconductor
RNA-Seq with R-Bioconductor
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkit
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Caporaso sloan qiime_workshop_slides_18_oct2012
Caporaso sloan qiime_workshop_slides_18_oct2012Caporaso sloan qiime_workshop_slides_18_oct2012
Caporaso sloan qiime_workshop_slides_18_oct2012
 
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
 
Aug2015 Giab nist integration methods
Aug2015 Giab nist integration methodsAug2015 Giab nist integration methods
Aug2015 Giab nist integration methods
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra tool
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 

En vedette

Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Keith Bradnam
 
DNA of building software products - Fast track method
DNA of building software products - Fast track methodDNA of building software products - Fast track method
DNA of building software products - Fast track methodProductNation/iSPIRT
 
Chenoweth os bridge 2015 pp
Chenoweth os bridge 2015 ppChenoweth os bridge 2015 pp
Chenoweth os bridge 2015 ppdreamwidth
 
Genome and Proteome data integration in RDF
Genome and Proteome data integration in RDFGenome and Proteome data integration in RDF
Genome and Proteome data integration in RDFNadia Anwar
 
Profile A.I.Macan Markar & Co.
Profile A.I.Macan Markar & Co.Profile A.I.Macan Markar & Co.
Profile A.I.Macan Markar & Co.Arjuna Dangalla
 
Application of Marker Assisted Selection (MAS) for the improvement of Bean Co...
Application of Marker Assisted Selection (MAS) for the improvement of Bean Co...Application of Marker Assisted Selection (MAS) for the improvement of Bean Co...
Application of Marker Assisted Selection (MAS) for the improvement of Bean Co...CIAT
 
Biology DNA Analysis
Biology DNA AnalysisBiology DNA Analysis
Biology DNA AnalysiseLearningJa
 
Recent biotechnology innovations
Recent biotechnology innovationsRecent biotechnology innovations
Recent biotechnology innovationsMuhammed sadiq
 
Biotechnological toools & their applications
Biotechnological toools & their applicationsBiotechnological toools & their applications
Biotechnological toools & their applicationsRishikesh Mishra
 
Back to Basics: Fundamental Concepts and Special Considerations in gDNA Isola...
Back to Basics: Fundamental Concepts and Special Considerations in gDNA Isola...Back to Basics: Fundamental Concepts and Special Considerations in gDNA Isola...
Back to Basics: Fundamental Concepts and Special Considerations in gDNA Isola...QIAGEN
 
Analysis and Interpretation of Cell-free DNA
Analysis and Interpretation of Cell-free DNAAnalysis and Interpretation of Cell-free DNA
Analysis and Interpretation of Cell-free DNAQIAGEN
 
Statistical approaches for the interpretation of DNA evidence
Statistical approaches for the  interpretation of DNA  evidenceStatistical approaches for the  interpretation of DNA  evidence
Statistical approaches for the interpretation of DNA evidencehindahaned
 
Evaluating allelic drop-out probabilities using a Monte-Carlo simulation appr...
Evaluating allelic drop-out probabilities using a Monte-Carlo simulation appr...Evaluating allelic drop-out probabilities using a Monte-Carlo simulation appr...
Evaluating allelic drop-out probabilities using a Monte-Carlo simulation appr...hindahaned
 

En vedette (19)

IPA for DNA analysis
IPA for DNA analysisIPA for DNA analysis
IPA for DNA analysis
 
Dragon's DNA
Dragon's DNADragon's DNA
Dragon's DNA
 
Dna baser
Dna baserDna baser
Dna baser
 
137920
137920137920
137920
 
Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2
 
DNA of building software products - Fast track method
DNA of building software products - Fast track methodDNA of building software products - Fast track method
DNA of building software products - Fast track method
 
Chenoweth os bridge 2015 pp
Chenoweth os bridge 2015 ppChenoweth os bridge 2015 pp
Chenoweth os bridge 2015 pp
 
Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
 
Genome and Proteome data integration in RDF
Genome and Proteome data integration in RDFGenome and Proteome data integration in RDF
Genome and Proteome data integration in RDF
 
Profile A.I.Macan Markar & Co.
Profile A.I.Macan Markar & Co.Profile A.I.Macan Markar & Co.
Profile A.I.Macan Markar & Co.
 
Biology for Computer Engineers:Part 1(www.ubio.in)
Biology for Computer Engineers:Part 1(www.ubio.in)Biology for Computer Engineers:Part 1(www.ubio.in)
Biology for Computer Engineers:Part 1(www.ubio.in)
 
Application of Marker Assisted Selection (MAS) for the improvement of Bean Co...
Application of Marker Assisted Selection (MAS) for the improvement of Bean Co...Application of Marker Assisted Selection (MAS) for the improvement of Bean Co...
Application of Marker Assisted Selection (MAS) for the improvement of Bean Co...
 
Biology DNA Analysis
Biology DNA AnalysisBiology DNA Analysis
Biology DNA Analysis
 
Recent biotechnology innovations
Recent biotechnology innovationsRecent biotechnology innovations
Recent biotechnology innovations
 
Biotechnological toools & their applications
Biotechnological toools & their applicationsBiotechnological toools & their applications
Biotechnological toools & their applications
 
Back to Basics: Fundamental Concepts and Special Considerations in gDNA Isola...
Back to Basics: Fundamental Concepts and Special Considerations in gDNA Isola...Back to Basics: Fundamental Concepts and Special Considerations in gDNA Isola...
Back to Basics: Fundamental Concepts and Special Considerations in gDNA Isola...
 
Analysis and Interpretation of Cell-free DNA
Analysis and Interpretation of Cell-free DNAAnalysis and Interpretation of Cell-free DNA
Analysis and Interpretation of Cell-free DNA
 
Statistical approaches for the interpretation of DNA evidence
Statistical approaches for the  interpretation of DNA  evidenceStatistical approaches for the  interpretation of DNA  evidence
Statistical approaches for the interpretation of DNA evidence
 
Evaluating allelic drop-out probabilities using a Monte-Carlo simulation appr...
Evaluating allelic drop-out probabilities using a Monte-Carlo simulation appr...Evaluating allelic drop-out probabilities using a Monte-Carlo simulation appr...
Evaluating allelic drop-out probabilities using a Monte-Carlo simulation appr...
 

Similaire à Initial steps towards a production platform for DNA sequence analysis on the grid

Implementation of GPU-based bioinformatic tools at the ENCODE DCC
Implementation of GPU-based bioinformatic tools at the ENCODE DCCImplementation of GPU-based bioinformatic tools at the ENCODE DCC
Implementation of GPU-based bioinformatic tools at the ENCODE DCCENCODE-DCC
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsJoão André Carriço
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
Computational Resources In Infectious Disease
Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious DiseaseJoão André Carriço
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysisYun Lung Li
 
Friedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsFriedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsBOSC 2010
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop finalMeng-Ru (Raymond) Tsai
 
HPC lab projects
HPC lab projectsHPC lab projects
HPC lab projectsJason Riedy
 
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...David Peyruc
 
Software Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The UglySoftware Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The UglyJoão André Carriço
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
Xu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud
Xu Xing: EasyGenomics – Next Generation Bioinformatics on the CloudXu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud
Xu Xing: EasyGenomics – Next Generation Bioinformatics on the CloudGigaScience, BGI Hong Kong
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Ben Busby
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
 
20120907 microbiome-intro
20120907 microbiome-intro20120907 microbiome-intro
20120907 microbiome-introLeo Lahti
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyJuan Antonio Vizcaino
 

Similaire à Initial steps towards a production platform for DNA sequence analysis on the grid (20)

Implementation of GPU-based bioinformatic tools at the ENCODE DCC
Implementation of GPU-based bioinformatic tools at the ENCODE DCCImplementation of GPU-based bioinformatic tools at the ENCODE DCC
Implementation of GPU-based bioinformatic tools at the ENCODE DCC
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Computational Resources In Infectious Disease
Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious Disease
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
Friedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsFriedberg bosc2010 iprstats
Friedberg bosc2010 iprstats
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
 
HPC lab projects
HPC lab projectsHPC lab projects
HPC lab projects
 
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
 
Software Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The UglySoftware Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The Ugly
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
Xu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud
Xu Xing: EasyGenomics – Next Generation Bioinformatics on the CloudXu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud
Xu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
20120907 microbiome-intro
20120907 microbiome-intro20120907 microbiome-intro
20120907 microbiome-intro
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
 

Dernier

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Dernier (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Initial steps towards a production platform for DNA sequence analysis on the grid

  • 1. Initial steps towards a production platform for DNA sequence analysis on the grid ISMB/ECCB conference – 18 July 2011 Barbera van Schaik, Angela Luyf, Michel de Vries, Frank Baas, Antoine van Kampen and Silvia Olabarriaga b.d.vanschaik@amc.uva.nl
  • 2. Overview Grid computing and workflow technology Example: Virus discovery Analysis of larger data sets Example: Genome of the Netherlands Challenges and summary
  • 3. Sequencing, Moore’s law and personnel Note: Acceleration Only slope is meaningful in this graph http://www.politigenomics.com/2009/02/the-scale-up.html
  • 4. What are the options? Local cluster Desktop grid Super computer Hadoop cluster GPU cluster Cloud computing (Inter) national Grid Each system has its own interface DNA computing Need to learn how they all work National computing facilities
  • 5. Grids Distributed resources Computing Data storage Open protocols It's all about sharing Resources Methods Collaborations
  • 6. Dutch grid (resources) grid http://www.biggrid.nl/
  • 7. Sequence facility People, resources and data flow My role Bioinformatics NGS team e-BioScience team grid Research laboratories
  • 8. Example: Virus discovery VIDISCA method Virus discovery unit exp1 exp1 exp1 exp1 exp1 exp1 exp6 exp1 exp1 exp3 exp2 exp1 GenBank - NR Goal: Identify known and discover new viruses in samples Michel de Vries et al (2011) PloS one
  • 9. BLAST analysis workflow Input: sequence reads Conversion step (sff to fasta) BLAST Output: BLAST results
  • 10. Implementation of workflow components Workflow description (XML) In: sequences In: sequences In: database (sff) (fasta) X (fasta) Component 1 (XML) Component 2 (XML) Executable/script: Executable/script: sff2fasta.pl BLAST Out: sequences Out: blast result (fasta) (txt) Tristan Glatard (2008) Future generation computer systems http://gwendia.i3s.unice.fr/doku.php?id=gwendia
  • 11. Run workflow on the grid Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In Biomedicine Tristan Glatard (2008) International Journal of High Performance Computing Applications
  • 12. Graphical user interface: VBrowser http://www.vl-e.nl/vbrowser
  • 14. Speed up exp1 exp1 exp1 exp1 exp1 exp1 exp6 exp1 exp1 Blast exp3 exp2 exp1 2 databases: Human ribosomal 15 experiments Viruses 722 samples Total CPU time: 413 hrs (~17 days) Elapsed time workflow: 13.7 hrs = 30x speed up Angela Luyf, Barbera van Schaik et al (2010) BMC Bioinformatics
  • 15. Benefits workflow technology Agile development Re-use of components Iteration strategy Knowledge about analysis steps captured in workflow
  • 16. Analysis of larger data sets Genome of the Netherlands (GoNL) 770 samples Whole genome 45 TB raw data sequencing of Many partners 250 trios (data sharing) Enrich biobanks Analysis on distributed sites Reference set for disease studies http://www.bbmri.nl/ http://www.nlgenome.nl/
  • 17. GoNL alignment pipeline Pair1.fastq Reference Pair2.fastq genome 160 samples (478 lanes) are currently analyzed on the Dutch grid BWA aln, sampe, sam-to-bam, sort bam, index Development and small tests: Picard mark duplicates Nov 22, 2010 - now GATK realignment Analysis: Mar 25, 2011 - Jul 15, 2011 Picard fix mates Jobs: 13,981 Total CPU time: 5.5 years GATK recalibration Disk space used: 315 TB Result.bam Pipeline similar to what is used at the Broad Institute. Implemented for GoNL by Freerk van Dijk (Groningen)
  • 18. Challenges • Error handling • Data management • Data protection • Provenance tracking • Transparent addition of other resources
  • 19. Summary More research and development needed in e-bioscience Latest IT infrastructures needed for scaling up NGS data analysis (grids, clouds, big clusters) Workflow technology assists agile implementation of bioinformatics software Separate workflow development from IT infrastructure for easier migration and expansion (middleware)
  • 20. Acknowledgements Genome of the University of Amsterdam Bioinformatics Laboratory, AMC Netherlands, NL Piter de Boer Antoine van Kampen Cisca Wijmenga Morris Swertz BiG Grid NGS bioinformatics team All project partners Jan Just Keijser Aldo Jongejan Tom Visser Marcel Willemsen Virus discovery unit, AMC Grid support Lia van der Hoek e-Bioscience team Michel de Vries Modalis, France Silvia Olabarriaga Johan Montagnat Angela Luyf Department of Mark Santcroos genome analysis, AMC Creatis, France Shayan Shahand Frank Baas Tristan Glatard Ted Bradley Marja Jakobs http://www.bioinformaticslaboratory.nl/
  • 21.
  • 22. BWA on grid – component description 22
  • 23. BWA on grid – component description 23
  • 24. BWA on grid – workflow description 24
  • 25. http://orange.ebioscience.amc.nl/ebioinfragateway/ e-BioInfra gateway No grid certificate needed Data upload via sFTP (intranet) Synced with grid storage Workflows are started from web page
  • 26. Implemented workflow components for next generation sequencing Existing software In-house software • BLAST • Roche software • Data format converters • BLAT • GATK • Quality trimming • BWA • Picard • Alternative splice product detection • Annovar • Samtools • CDR3 detection (T- and B-cell variation) • Varscan • Genome comparison (small genomes) • Newbler • FastQC