SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Assembling genomes using ABySS
           dnGASP 2011



            Shaun Jackman
      BC Genome Sciences Centre
         sjackman@bcgsc.ca
        abyss-users@bcgsc.ca
An assembly in two stages
●   Stage I: Sequence assembly algorithm
●   Stage II: Paired-end assembly algorithm




                                              2
Stage 1
      Sequence assembly algorithm
●   Load the reads,                  Load k-mers
    breaking each read into k-mers
●   Find adjacent k-mers, which      Find overlaps
    overlap by k-1 bases
●   Remove k-mers resulting from     Prune tips
    read errors
●   Remove variant sequences         Pop bubbles

●   Generate contigs
                                     Generate contigs



                                                        3
Load the reads
●   For each input read of length l, (l - k + 1) k-mers
    are generated by sliding a window of length k
    over the read
      Read (l = 12):    ● Each k-mer is a vertex of
         ATCATACATGAT   the de Bruijn graph
      k-mers (k = 9):
         ATCATACAT      ●Two adjacent k-mers are
          TCATACATG     an edge of the de Bruijn
           CATACATGA
            ATACATGAT   graph

                                                      4
De Bruijn Graph
●   A simple graph for k = 5
●   Two reads
        –   GGACATC
        –   GGACAGA
                           GACAT      ACATC
            GGACA


                           GACAG      ACAGA


                                              5
Pruning tips
●   Read errors cause
    tips




                                6
Pruning tips
●   Read errors cause
    tips
●   Pruning tips
    removes the
    erroneous reads
    from the assembly




                                7
Popping bubbles
●   Variant sequences cause
    bubbles
●   Popping bubbles removes
    the variant sequence from
    the assembly
●   Repeat sequences with
    small differences also
    cause bubbles




                                 8
Assemble contigs
●   Remove ambiguous
    edges
●   Output contigs in
    FASTA format




                                  9
Paired-end assembly algorithm
                       Stage 2
●   Align the reads to the contigs of the first stage
●   Generate an empirical fragment-size
    distribution using the paired reads that align to
    the same contig
●   Estimate the distance between contigs using
    the paired reads that align to different contigs




                                                        10
Align the reads to the contigs
                      KAligner
●   Every k-mer in the single-end
    assembly is unique
●   KAligner can map reads with k
    consecutive correct bases
●   ABySS may use other aligners,
    including BWA and bowtie




                                        11
Empirical fragment-size distribution
                     ParseAligns
●   Generate an empirical fragment-size
    distribution using the paired reads that align to
    the same contig




                                                        12
Estimate distances between contigs
                     DistanceEst
●   Estimate the distance between contigs using
    the paired reads that align to different contigs

                           d = 25 ± 8

                      d=3±5


                        d=6±5




                        d=4±3

                                                       13
Maximum likelihood estimator
                    DistanceEst
●   Use the empirical paired-
    end size distribution
●   Maximize the likelihood
    function
●   Find the most likely
    distance between the two
    contigs



                                     14
Paired-end algorithm
                   continued...
●   Find paths through the contig
    adjacency graph that agree with    Generate paths
    the distance estimates
●   Merge overlapping paths             Merge paths

●   Merge the contigs in these paths
                                       Generate contigs
    and output the FASTA file




                                                      15
Find consistent paths
                    SimpleGraph
●   Find paths through the contig adjacency graph
    that agree with the distance estimates




                     d=4±3

                  Actual distance = 3
                                                    16
Merge overlapping paths
                    MergePaths
●   Merge paths that overlap




                                   17
Generate the FASTA output
●   Merge the contigs in these paths.
●   Output the FASTA file




    GATTTTTG   GAC GTCTTGATCTT   CAC    GTATTG CTATT

                                                       18
Assembly process
●   Stage 1 completed in 3.5 hours
●   Used 72 processors on six machines
●   Peak memory usage of 180 GB of RAM
●   Stage 2 completed in 9 hours
●   Used 12 processors on one machine
●   Peak memory usage of 48 GB of RAM
●   Assembly parameters k=64 s=200 n=10

                                          19
Assembly results
          Level 1: 500-bp paired-end reads
●   Assembled half the genome in 7,676 contigs
    larger than the N50 of 50,612 bp
●   Assembled 1.81 Gbp in 170,407 contigs larger
    than 200 bp
●   The largest contig is 1,158,576 bp
●   Removed 1,296,819 variant sequences




                                                   20
Alignments to the reference
●   Aligned the 170,407 contigs longer than 200 bp
●   96.2% align at least 99% length
●   1.2% align between 90% and 99% length
●   2.5% align less than 90% length


                               >99%
                               90-99%
                               <90%




                                                 21
Works in progress
●   Replace complex variant sequences with Ns
●   Scaffold over gaps and simple repeat sequence
    using large fragment mate-pair reads
●   Filling in gaps with sequence using localized
    microassembly




                                                    22
ABySS Publications
         IEEE InfoVis 2009
Acknowledgments
    Supervisors
●   İnanç Birol
●   Steven Jones
    Team
●   Readman Chiu
●   Rod Docking
●   Karen Mungall
●   Jenny Qian
                                24
25

Contenu connexe

Similaire à Assembling genomes using ABySS

Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Anton Alexandrov
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
sesejun
 
Maximum Likelihood Scaffold Assembly
Maximum Likelihood Scaffold AssemblyMaximum Likelihood Scaffold Assembly
Maximum Likelihood Scaffold Assembly
Alexey Sergushichev
 

Similaire à Assembling genomes using ABySS (20)

Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
 
Maximum Likelihood Scaffold Assembly
Maximum Likelihood Scaffold AssemblyMaximum Likelihood Scaffold Assembly
Maximum Likelihood Scaffold Assembly
 
Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014Dictionary Learning in Games - GDC 2014
Dictionary Learning in Games - GDC 2014
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11
 
mst.pdf
mst.pdfmst.pdf
mst.pdf
 
Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.
 
Halide - 2
Halide - 2 Halide - 2
Halide - 2
 
Gpu Join Presentation
Gpu Join PresentationGpu Join Presentation
Gpu Join Presentation
 
An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphs
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Foveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUsFoveated Ray Tracing for VR on Multiple GPUs
Foveated Ray Tracing for VR on Multiple GPUs
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
CPQ_presentation_ICCV2021
CPQ_presentation_ICCV2021CPQ_presentation_ICCV2021
CPQ_presentation_ICCV2021
 

Dernier

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Dernier (20)

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 

Assembling genomes using ABySS

  • 1. Assembling genomes using ABySS dnGASP 2011 Shaun Jackman BC Genome Sciences Centre sjackman@bcgsc.ca abyss-users@bcgsc.ca
  • 2. An assembly in two stages ● Stage I: Sequence assembly algorithm ● Stage II: Paired-end assembly algorithm 2
  • 3. Stage 1 Sequence assembly algorithm ● Load the reads, Load k-mers breaking each read into k-mers ● Find adjacent k-mers, which Find overlaps overlap by k-1 bases ● Remove k-mers resulting from Prune tips read errors ● Remove variant sequences Pop bubbles ● Generate contigs Generate contigs 3
  • 4. Load the reads ● For each input read of length l, (l - k + 1) k-mers are generated by sliding a window of length k over the read Read (l = 12): ● Each k-mer is a vertex of ATCATACATGAT the de Bruijn graph k-mers (k = 9): ATCATACAT ●Two adjacent k-mers are TCATACATG an edge of the de Bruijn CATACATGA ATACATGAT graph 4
  • 5. De Bruijn Graph ● A simple graph for k = 5 ● Two reads – GGACATC – GGACAGA GACAT ACATC GGACA GACAG ACAGA 5
  • 6. Pruning tips ● Read errors cause tips 6
  • 7. Pruning tips ● Read errors cause tips ● Pruning tips removes the erroneous reads from the assembly 7
  • 8. Popping bubbles ● Variant sequences cause bubbles ● Popping bubbles removes the variant sequence from the assembly ● Repeat sequences with small differences also cause bubbles 8
  • 9. Assemble contigs ● Remove ambiguous edges ● Output contigs in FASTA format 9
  • 10. Paired-end assembly algorithm Stage 2 ● Align the reads to the contigs of the first stage ● Generate an empirical fragment-size distribution using the paired reads that align to the same contig ● Estimate the distance between contigs using the paired reads that align to different contigs 10
  • 11. Align the reads to the contigs KAligner ● Every k-mer in the single-end assembly is unique ● KAligner can map reads with k consecutive correct bases ● ABySS may use other aligners, including BWA and bowtie 11
  • 12. Empirical fragment-size distribution ParseAligns ● Generate an empirical fragment-size distribution using the paired reads that align to the same contig 12
  • 13. Estimate distances between contigs DistanceEst ● Estimate the distance between contigs using the paired reads that align to different contigs d = 25 ± 8 d=3±5 d=6±5 d=4±3 13
  • 14. Maximum likelihood estimator DistanceEst ● Use the empirical paired- end size distribution ● Maximize the likelihood function ● Find the most likely distance between the two contigs 14
  • 15. Paired-end algorithm continued... ● Find paths through the contig adjacency graph that agree with Generate paths the distance estimates ● Merge overlapping paths Merge paths ● Merge the contigs in these paths Generate contigs and output the FASTA file 15
  • 16. Find consistent paths SimpleGraph ● Find paths through the contig adjacency graph that agree with the distance estimates d=4±3 Actual distance = 3 16
  • 17. Merge overlapping paths MergePaths ● Merge paths that overlap 17
  • 18. Generate the FASTA output ● Merge the contigs in these paths. ● Output the FASTA file GATTTTTG GAC GTCTTGATCTT CAC GTATTG CTATT 18
  • 19. Assembly process ● Stage 1 completed in 3.5 hours ● Used 72 processors on six machines ● Peak memory usage of 180 GB of RAM ● Stage 2 completed in 9 hours ● Used 12 processors on one machine ● Peak memory usage of 48 GB of RAM ● Assembly parameters k=64 s=200 n=10 19
  • 20. Assembly results Level 1: 500-bp paired-end reads ● Assembled half the genome in 7,676 contigs larger than the N50 of 50,612 bp ● Assembled 1.81 Gbp in 170,407 contigs larger than 200 bp ● The largest contig is 1,158,576 bp ● Removed 1,296,819 variant sequences 20
  • 21. Alignments to the reference ● Aligned the 170,407 contigs longer than 200 bp ● 96.2% align at least 99% length ● 1.2% align between 90% and 99% length ● 2.5% align less than 90% length >99% 90-99% <90% 21
  • 22. Works in progress ● Replace complex variant sequences with Ns ● Scaffold over gaps and simple repeat sequence using large fragment mate-pair reads ● Filling in gaps with sequence using localized microassembly 22
  • 23. ABySS Publications IEEE InfoVis 2009
  • 24. Acknowledgments Supervisors ● İnanç Birol ● Steven Jones Team ● Readman Chiu ● Rod Docking ● Karen Mungall ● Jenny Qian 24
  • 25. 25