SlideShare une entreprise Scribd logo
1  sur  16
Luc Dehaspe Genomics Core, UZ Leuven WOUD – Onderzoeksgroep Associatie Universiteit Gent - 28 Sept 2011  Race against the sequencing machineProcessing of raw DNA sequence data at the Genomics Core
DNA sequencing determines the order of nucleotide bases in a genome DNA replicationmachinary HumanGenome 2 x 3 billion bases Human Genome 2 x 3 billion bases hours Sequencing machine FinalGenerationSequencing machine Computer’s copyfunction Human Genome 2 x 800 Mbtext Human Genome 2 x 800 Mbtext minutes
Nextgeneration sequencing Qualitydeterioratesafter 100-1000 base pairs Solution: Cut genomes in readablefragments Sequencefragments->reads Usebioinformatics to reconstruct genomes fromreads HumanGenome 2 x 3 billion bases NextGenerationSequencing machine Reads in textformat bioinformatics Human Genome 2 x 800 Mbtext
SequencersvsBioinformatics HumanGenome 2 x 3 billion bases HiSeq 2000 v3 HiSeq 2000 v2 Roche GS FLX 55billion bases per day 6 Human Genomes in 10 days 18billion bases per day 1billionbpd bioinformatics Scale up bioinformaticsor pile up sequencer output Human Genome 2 x 800 Mbtext
 Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Bioinformaticspipeline Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome
 Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Bioinformaticspipeline Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome Variant Calling Comparepileup of reads at givenlocus to reference, identifySNPs, insertions and deletions
A bioinformaticspipeline  Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome Variant Calling Compare to reference, identifySNPs, insertions and deletions Annotatevariants (gene, effect onproteinsequence, conservation, frequency, predicted effect onproteinfunction, … Annotation Sequencing: 10 days Abovepipeline: > 60 dayson 1 cpu Scale up orpile up
Favourable race conditions Sametaskperformedonmanyreadsorloci FOR 1.1 billionindexedreads DO Identify sample FOR 3 billionHuman Genome loci DO Comparelocus in alignedreads to reference and identify homo- and heterozygoticSNPs Resultsforoneread/locus independent of resultsforotherreads/loci Suggestsnaturalscale up strategy …
Data parallelism Reads or loci partitioned among nodes of computer cluster  Each node demultiplexes, aligns, etc on local partition Speed up (near) linear to number of cluster nodes Variant calling 3 billionHuman Genome loci Variant calling Chr1 Variant callingChrY Cluster of 24 computers (nodes)
Data parallelism DemultiplexHiSeq 2000 microplate 1 node, 1.1 billionreads 1600 reads per second 8 days 1 microplate ,[object Object],1 1 day …  8 lanes ,[object Object],8 1 1 384 ½ hour 384 tiles …
Favourable race conditions MapReduce: data parallelism made easy Developed and extensivelyused at Google Open sourcelibrary (C++) takes care of Parallelization Fault Tolerance Data Distribution Load Balancing No knowledge of parallel systems required User implements functions Map() and Reduce()
MapReduce: demultiplexreads 8 lanes 8 Map tasks … Map: sortreads Map: sortreads Sample1 Sample3 Sample2 Sample1 Sample3 Sample2 Waituntil map has finished 8 1  Sample1 reads  Sample3 reads  Sample2 reads Reduce: deduplicatereads Reduce: deduplicatereads Reduce: deduplicatereads Sample1.fastq.gz Sample3.fastq.gz Sample2.fastq.gz
Favourable Race Conditions GATK: MapReducefor sequencing projects Genome analysis toolkit Developedby and usedextensively at BroadInstitute (Harvard and MIT) Open Source, Java 1.6 framework Provides common data accesspatterns Traversalbyread Traversalbylocus
Favourable race conditions Data parallelismsupportedbymany (open source) bioinformatics tools Number of nodes is parameter Full analysispipelineswidelyavailable GATK CASAVA …
Conclusion Data parallelism is key Scale up bybuying extra cluster nodes Genomics core recentlyadded 400 nodes(shared) Cannedsolutionsforcommonbioinformaticstasks Establishedprogrammingframeworksforcustomsolutions MapReduce GATK
Conclusion Bioinformaticiansenjoyfavourableconditionsforkeepingpacewithsequencer … HumanGenome 2 x 3 billion bases NextGenerationSequencing machine FinalGeneration Sequencing machine Reads in textformat Bioinformaticsusing data parallelism Human Genome 2 x 800 Mbtext ,[object Object]

Contenu connexe

En vedette

China health presentation may 2012
China health presentation may 2012China health presentation may 2012
China health presentation may 2012healthchina
 
Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...
Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...
Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...L.E.K. Consulting
 
China Exit or Co-Investment Opportunities for German PE Investors
China Exit or Co-Investment Opportunities for German PE InvestorsChina Exit or Co-Investment Opportunities for German PE Investors
China Exit or Co-Investment Opportunities for German PE InvestorsL.E.K. Consulting
 
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...QIAGEN
 
QIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
QIAseq Technologies for Metagenomics and Microbiome NGS Library PrepQIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
QIAseq Technologies for Metagenomics and Microbiome NGS Library PrepQIAGEN
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyQIAGEN
 

En vedette (7)

China health presentation may 2012
China health presentation may 2012China health presentation may 2012
China health presentation may 2012
 
Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...
Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...
Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...
 
China Exit or Co-Investment Opportunities for German PE Investors
China Exit or Co-Investment Opportunities for German PE InvestorsChina Exit or Co-Investment Opportunities for German PE Investors
China Exit or Co-Investment Opportunities for German PE Investors
 
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
 
QIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
QIAseq Technologies for Metagenomics and Microbiome NGS Library PrepQIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
QIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) Technology
 

Similaire à Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 
Unilag workshop complex genome analysis
Unilag workshop   complex genome analysisUnilag workshop   complex genome analysis
Unilag workshop complex genome analysisDr. Olusoji Adewumi
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataAlireza Doustmohammadi
 
DNA memories
DNA memoriesDNA memories
DNA memoriesHoda msw
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64PeterMaf
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64PeterMaf
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 

Similaire à Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core (20)

Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Unilag workshop complex genome analysis
Unilag workshop   complex genome analysisUnilag workshop   complex genome analysis
Unilag workshop complex genome analysis
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing Data
 
Genome comparision
Genome comparisionGenome comparision
Genome comparision
 
DNA memories
DNA memoriesDNA memories
DNA memories
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64
 
Understanding Genome
Understanding Genome Understanding Genome
Understanding Genome
 
NCBI
NCBINCBI
NCBI
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 

Plus de Maté Ongenaert

Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...Maté Ongenaert
 
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Maté Ongenaert
 
Ecobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenEcobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenMaté Ongenaert
 
Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Maté Ongenaert
 
ENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsMaté Ongenaert
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisExploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisMaté Ongenaert
 
High-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themHigh-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themMaté Ongenaert
 
Microarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the benchMicroarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the benchMaté Ongenaert
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyMaté Ongenaert
 
Integrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functionsIntegrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functionsMaté Ongenaert
 
Bringing the data back to the researchers
Bringing the data back to the researchersBringing the data back to the researchers
Bringing the data back to the researchersMaté Ongenaert
 
The post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationThe post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationMaté Ongenaert
 
Literature managment training
Literature managment trainingLiterature managment training
Literature managment trainingMaté Ongenaert
 
Scientific literature managment - exercises
Scientific literature managment - exercisesScientific literature managment - exercises
Scientific literature managment - exercisesMaté Ongenaert
 

Plus de Maté Ongenaert (18)

Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
 
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
 
Ecobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenEcobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis Lokeren
 
Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Workshop NGS data analysis - 3
Workshop NGS data analysis - 3
 
ENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsENCODE project: brief summary of main findings
ENCODE project: brief summary of main findings
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Bots & spiders
Bots & spidersBots & spiders
Bots & spiders
 
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisExploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
 
High-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themHigh-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting them
 
Microarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the benchMicroarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the bench
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
 
Integrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functionsIntegrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functions
 
Bringing the data back to the researchers
Bringing the data back to the researchersBringing the data back to the researchers
Bringing the data back to the researchers
 
The post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationThe post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integration
 
Introduction
IntroductionIntroduction
Introduction
 
Literature managment training
Literature managment trainingLiterature managment training
Literature managment training
 
Scientific literature managment - exercises
Scientific literature managment - exercisesScientific literature managment - exercises
Scientific literature managment - exercises
 

Dernier

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

  • 1. Luc Dehaspe Genomics Core, UZ Leuven WOUD – Onderzoeksgroep Associatie Universiteit Gent - 28 Sept 2011 Race against the sequencing machineProcessing of raw DNA sequence data at the Genomics Core
  • 2. DNA sequencing determines the order of nucleotide bases in a genome DNA replicationmachinary HumanGenome 2 x 3 billion bases Human Genome 2 x 3 billion bases hours Sequencing machine FinalGenerationSequencing machine Computer’s copyfunction Human Genome 2 x 800 Mbtext Human Genome 2 x 800 Mbtext minutes
  • 3. Nextgeneration sequencing Qualitydeterioratesafter 100-1000 base pairs Solution: Cut genomes in readablefragments Sequencefragments->reads Usebioinformatics to reconstruct genomes fromreads HumanGenome 2 x 3 billion bases NextGenerationSequencing machine Reads in textformat bioinformatics Human Genome 2 x 800 Mbtext
  • 4. SequencersvsBioinformatics HumanGenome 2 x 3 billion bases HiSeq 2000 v3 HiSeq 2000 v2 Roche GS FLX 55billion bases per day 6 Human Genomes in 10 days 18billion bases per day 1billionbpd bioinformatics Scale up bioinformaticsor pile up sequencer output Human Genome 2 x 800 Mbtext
  • 5. Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Bioinformaticspipeline Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome
  • 6. Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Bioinformaticspipeline Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome Variant Calling Comparepileup of reads at givenlocus to reference, identifySNPs, insertions and deletions
  • 7. A bioinformaticspipeline Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome Variant Calling Compare to reference, identifySNPs, insertions and deletions Annotatevariants (gene, effect onproteinsequence, conservation, frequency, predicted effect onproteinfunction, … Annotation Sequencing: 10 days Abovepipeline: > 60 dayson 1 cpu Scale up orpile up
  • 8. Favourable race conditions Sametaskperformedonmanyreadsorloci FOR 1.1 billionindexedreads DO Identify sample FOR 3 billionHuman Genome loci DO Comparelocus in alignedreads to reference and identify homo- and heterozygoticSNPs Resultsforoneread/locus independent of resultsforotherreads/loci Suggestsnaturalscale up strategy …
  • 9. Data parallelism Reads or loci partitioned among nodes of computer cluster Each node demultiplexes, aligns, etc on local partition Speed up (near) linear to number of cluster nodes Variant calling 3 billionHuman Genome loci Variant calling Chr1 Variant callingChrY Cluster of 24 computers (nodes)
  • 10.
  • 11. Favourable race conditions MapReduce: data parallelism made easy Developed and extensivelyused at Google Open sourcelibrary (C++) takes care of Parallelization Fault Tolerance Data Distribution Load Balancing No knowledge of parallel systems required User implements functions Map() and Reduce()
  • 12. MapReduce: demultiplexreads 8 lanes 8 Map tasks … Map: sortreads Map: sortreads Sample1 Sample3 Sample2 Sample1 Sample3 Sample2 Waituntil map has finished 8 1 Sample1 reads Sample3 reads Sample2 reads Reduce: deduplicatereads Reduce: deduplicatereads Reduce: deduplicatereads Sample1.fastq.gz Sample3.fastq.gz Sample2.fastq.gz
  • 13. Favourable Race Conditions GATK: MapReducefor sequencing projects Genome analysis toolkit Developedby and usedextensively at BroadInstitute (Harvard and MIT) Open Source, Java 1.6 framework Provides common data accesspatterns Traversalbyread Traversalbylocus
  • 14. Favourable race conditions Data parallelismsupportedbymany (open source) bioinformatics tools Number of nodes is parameter Full analysispipelineswidelyavailable GATK CASAVA …
  • 15. Conclusion Data parallelism is key Scale up bybuying extra cluster nodes Genomics core recentlyadded 400 nodes(shared) Cannedsolutionsforcommonbioinformaticstasks Establishedprogrammingframeworksforcustomsolutions MapReduce GATK
  • 16.