SlideShare une entreprise Scribd logo
1  sur  56
The best of both worlds
Combining PacBio with short read technology
  for improved de novo genome assembly

          Lex Nederbragt, NSC and CEES
           lex.nederbragt@bio.uio.no
This talk
Why does everybody want longer reads?


        … for genome assemblies
What is a genome assembly


    Hierarchical structure

reads

 contigs

   scaffolds
Sequence data

                           Reads
                                                    reads

                                                      contigs

                                                        scaffolds



original DNA

 fragments




original DNA

 fragments

                  Sequenced ends




               http://www.cbcb.umd.edu/research/assembly_primer.shtml
Contigs

                          Building contigs
                                                               reads

                                                                 contigs

                                                                   scaffolds


                 ACGCGATTCAGGTTACCACG
                   GCGATTCAGGTTACCACGCG
                     GATTCAGGTTACCACGCGTA
                       TTCAGGTTACCACGCGTAGC
                         CAGGTTACCACGCGTAGCGC
  Aligned reads            GGTTACCACGCGTAGCGCAT
                             TTACCACGCGTAGCGCATTA
                                ACCACGCGTAGCGCATTACA
                                  CACGCGTAGCGCATTACACA
                                    CGCGTAGCGCATTACACAGA
                                      CGTAGCGCATTACACAGATT
                                        TAGCGCATTACACAGATTAG
Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
Contigs

                          Building contigs
                                                                     reads

                                                                       contigs

                                                                         scaffolds




     Repeat copy 1                                    Repeat copy 2




                                                          Contig orientation?
                                                            Contig order?




Collapsed repeat
   consensus
                     http://www.cbcb.umd.edu/research/assembly_primer.shtml
Mate pairs

                          Other read type
                                                      reads

                                                        contigs

                                                          scaffolds




     Repeat copy 1                      Repeat copy 2




(much) longer fragments
                                            mate pair reads
Scaffolds

                 Ordered, oriented contigs
                                               reads

                                                 contigs

                                                   scaffolds




    mate pairs
contigs



                           gap size estimate
What is a genome assembly


    Hierarchical structure

reads                            ACGCGATTCAGGTTACCACG
                                   GCGATTCAGGTTACCACGCG
                                     GATTCAGGTTACCACGCGTA
                                       TTCAGGTTACCACGCGTAGC
                                         CAGGTTACCACGCGTAGCGC
                  Aligned reads            GGTTACCACGCGTAGCGCAT
                                             TTACCACGCGTAGCGCATTA
                                                ACCACGCGTAGCGCATTACA
                                                  CACGCGTAGCGCATTACACA
                                                    CGCGTAGCGCATTACACAGA

 contigs                                              CGTAGCGCATTACACAGATT
                                                        TAGCGCATTACACAGATTAG
                Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG




   scaffolds
Genome assembly




So, what’s so hard about it?
1) Repeats

                                                                     reads

                                                                       contigs

                                                                         scaffolds




     Repeat copy 1                                    Repeat copy 2




                                         Repeats break up contigs


Collapsed repeat
   consensus
                     http://www.cbcb.umd.edu/research/assembly_primer.shtml
2) Heterozygosity



                                                               Differences
                                                              between sister
                                                          *   chromosomes



                                                          *




                                                          *




http://commons.wikimedia.org/wiki/File:Chromosome_1.svg
2) Heterozygosity




             Polymorphic contig 2

Contig 1                            Contig 4
             Polymorphic contig 3
2) Heterozygosity




http://www.astraean.com/borderwars/wp-content/uploads/2012/04/heterozygoats.jpg
and many other sites
3) Many programs to choose from




Zhang et al. (2011) doi:10.1371/journal.pone.0017915.g001
Assembly: challenges
         Repeat copy 1                               Repeat copy 2




                         Knowing how to use the programs



Heterozygosity
                              Polymorphic contig 2

          Contig 1                                            Contig 4
                              Polymorphic contig 3
So, why does everybody want longer reads?




http://www.autobizz.com.my/forum/forum/General-Chat/944-The-worlds-longest-car.html
Longer reads?
Repeat copy 1                                 Repeat copy 2




    Long reads can span repeats and heterozygous regions




                       Polymorphic contig 2

 Contig 1                                              Contig 4
                       Polymorphic contig 3
PacBio to the rescue?
High-throughput sequencing

                           Library preparation

SMRTBell'template'
SMRTBell'template'




Standard'Sequencing'
Standard'Sequencing'

                                           Generates& pass& each&
                                                    one&  on&   molecule&
           Insert&
      Large&     Sizes&                    Generates& pass& each&
                                                    one&  on&   molecule&
     Large Insert& Sizes
      Large&     Sizes&
            Insert                         sequenced&
                                            Single pass
                                           sequenced&


Circular'Consensus'Sequencing'
Circular'Consensus'Sequencing'                               Continued generations
                                                             of reads

  Small Insert Sizes&
   Small&
   Small&
         Insert&
               Sizes
         Insert&
               Sizes&

                                           Multiple mul8ple&
                                                    passes passes& each&
                                           Generates&            on&   molecule&
                                           Generates&
                                                    mul8ple&
                                           sequenced&      passes& each&
                                                                 on&   molecule&
                                           sequenced&
High-throughput sequencing

      Raw read length
High-throughput sequencing
SMRTBell'template'

                           Raw reads and subreads

Standard'Sequencing'


                                            Generates& pass& each&
                                                     one&  on&   molecule&
     Large Insert& Sizes
      Large&     Sizes&
            Insert                           Single pass
                                            sequenced&


                                           ‘Subreads’
Circular'Consensus'Sequencing'



  Small Insert Sizes&
   Small&Insert&
               Sizes

                                            Multiple mul8ple&
                                                     passes passes& each&
                                            Generates&            on&   molecule&
                                            sequenced&
PacBio: uses
SMRTBell'template'

                           Long reads  low quality

Standard'Sequencing'


                                             Generates& pass& each&
                                                      one&  on&   molecule&
     Large Insert& Sizes
      Large&     Sizes&
            Insert                            Single pass
                                             sequenced&
                                               85-87% accuracy
Circular'Consensus'Sequencing'
                             Useful for assembly?
    Small&
         Insert&
               Sizes&


                                             Generates&
                                                      mul8ple&
                                                             passes& each&
                                                                   on&   molecule&
                                             sequenced&
Solutions for assembly
Solutions for assembly (1)




   Designed by Pacific Biosciences




http://www.clker.com/clipart-4245.html
Solutions for assembly (2)
   Broad Institute




Need a special recipe
  for sequencing
Solutions for assembly (3)

                 PacBioToCA
        Error correct with short reads




Celera assembler


   http://schatzlab.cshl.edu/presentations/2012-01-17.PAG.SMRTassembly.pdf
PacBioToCA




             Koren et al, 2012
Shameless self-promotion

flxlexblog.wordpress.com
Shameless self-promotion




            @lexnederbragt
The Atlantic cod genome project
First draft




Fragmented assembly
    - short contigs
    - many gap bases
                                http://en.wikipedia.org
First draft



6467 scaffolds




                   35% gap bases
The causes




Short Tandem Repeats (>20% of gaps)
The causes


           Heterozygosity?



            Polymorphic contig 2

Contig 1                           Contig 4
            Polymorphic contig 3
The goal



 23 pseudochromosomes




       Longer contigs




                        Below 5% gap bases



PacBio to the rescue?
The approach
 SMRTBell'template'


         Libraries

 Standard'Sequencing'


                                  Generates& pass& ea
                                           one&  on&
      Large Insert& Sizes
       Large&     Sizes&
             Insert               sequenced&


Aim for looooong insert sizes
 Circular'Consensus'Sequencing'


     Small&
          Insert&
                Sizes&


                                  Generates&
                                           mul8ple&
                                                  passes
                                  sequenced&
SMRTBell'template'        The approach

                                  Sequencing
      Standard'Sequencing'


                                                Generates& pass& each&
                                                         one&  on&   molecule&
          Large Insert& Sizes
           Large&     Sizes&
                 Insert                          Single pass
                                                sequenced&


    Sequence with 90 minute movies
     Circular'Consensus'Sequencing'


         Small&
              Insert&
                    Sizes&


                                                Generates&
                                                         mul8ple&
                                                                passes& each&
                                                                      on&   molecule&
10 x coverage in reads of at least 3000 bp      sequenced&




                No, we don’t throw this away…
The approach

Error-correction
PacBio results
                               100          Relative throughput at different minimum length cutoffs


                                                                                                      10kb lib 2
                                                                    Fraction of bases at minimum 10kb lib 1
                                                                                                 length
                                                                                                      4kb lib
                               80
Percentage of total sequence

                               60
                               40
                               20
                               0




                                     0kbp   3kbp      5kbp                                   10kbp              15kbp

                                                             Length cutoff longest subread


                                               Large library insert size important!
chnology

                                        PacBio results




              SMRTBell'template'
                 64 SMRT Cells
                                                    3.2 Gigabytes in raw reads at least 3kb
                                                                3.8 x coverage
                                                3




              Standard'Sequencing'


                                                        Generates& pass& each&
                                                                 one&  on&   molecule&
                  Large Insert& Sizes
                   Large&     Sizes&
                         Insert                         sequenced&


      2.2 Gigabytes in longest subreads reads
             Circular'Consensus'Sequencing'
                   Largest 15 kbp

                 Small&
                      Insert&
                            Sizes&


                                                        Generates&
                                                                 mul8ple&
                                                                        passes& each&
                                                                              on&   molecule&
PacBio results

Mapping to the cod genome
      11.4 kbp subread




       10.6 kbp subread




      10.9 kbp subread
Example 1


ACACAC repeat




232 bp Gap




TGTGTG repeat
Example 1
Example 1
Example 1
Scaffold               ...ACACAC     TGTGTG...

PacBio reads
               Unplaced contig
Example 2


TGTGTG repeat




     344 bp Gap
Example 2
Example 2

Scaffold       ...TGTGTG

PacBio reads

                     Heterozygosity?
Example 3

Scaffold


   PacBio reads
                  300 bp misassembly?
Error-correction




                          Work In Progress
http://openclipart.org/
Outlook




  Will PacBio solve our problems?
Outlook




  Or
Outlook



                Polymorphic contig 2

Contig 1                               Contig 4
                Polymorphic contig 3




   Will we find the heterozygous regions?
Outlook




 http://www.pasteur.fr/recherche/unites/Bbi/
 en.wikipedia.org
 and Martin Malmstrøm

Contenu connexe

Tendances

Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading Frames
Osama Zahid
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
Dayananda Salam
 
Protein Predictinon
Protein PredictinonProtein Predictinon
Protein Predictinon
SHRADHEYA GUPTA
 
Lecture 3 gene cloning strategies
Lecture 3 gene cloning strategiesLecture 3 gene cloning strategies
Lecture 3 gene cloning strategies
Ishah Khaliq
 

Tendances (20)

Quality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withQuality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained with
 
Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading Frames
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysis
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
Complement system
Complement systemComplement system
Complement system
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Finding ORF
Finding ORFFinding ORF
Finding ORF
 
Construction of genomic library in lambda
Construction of genomic library in lambdaConstruction of genomic library in lambda
Construction of genomic library in lambda
 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoS
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 
Protein Predictinon
Protein PredictinonProtein Predictinon
Protein Predictinon
 
Sequence database
Sequence databaseSequence database
Sequence database
 
Plasmids
PlasmidsPlasmids
Plasmids
 
Protein modeling
Protein modelingProtein modeling
Protein modeling
 
Kegg
KeggKegg
Kegg
 
State-of-the-Art Normalization of RT-qPCR Data
State-of-the-Art Normalization of RT-qPCR Data State-of-the-Art Normalization of RT-qPCR Data
State-of-the-Art Normalization of RT-qPCR Data
 
Protein database
Protein databaseProtein database
Protein database
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 
Lecture 3 gene cloning strategies
Lecture 3 gene cloning strategiesLecture 3 gene cloning strategies
Lecture 3 gene cloning strategies
 

En vedette

Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
Jan Aerts
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics
Mads Albertsen
 

En vedette (20)

IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataIonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
NGS technologies - platforms and applications
NGS technologies - platforms and applicationsNGS technologies - platforms and applications
NGS technologies - platforms and applications
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
 
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
 
2014 June 17 PacBio User Group Meeting Presentation "How Looking for a Needle...
2014 June 17 PacBio User Group Meeting Presentation "How Looking for a Needle...2014 June 17 PacBio User Group Meeting Presentation "How Looking for a Needle...
2014 June 17 PacBio User Group Meeting Presentation "How Looking for a Needle...
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
Improving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBioImproving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBio
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
20140711 3 t_clark_ercc2.0_workshop
20140711 3 t_clark_ercc2.0_workshop20140711 3 t_clark_ercc2.0_workshop
20140711 3 t_clark_ercc2.0_workshop
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics
 
Semiconductor Sequencing Applications for Plant Sciences
Semiconductor Sequencing Applications for Plant SciencesSemiconductor Sequencing Applications for Plant Sciences
Semiconductor Sequencing Applications for Plant Sciences
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 

Similaire à Combining PacBio with short read technology for improved de novo genome assembly (8)

2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
Genome Assembly copy
Genome Assembly   copyGenome Assembly   copy
Genome Assembly copy
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly Problem
 
Git Going With DVCS v1.5.2
Git Going With DVCS v1.5.2Git Going With DVCS v1.5.2
Git Going With DVCS v1.5.2
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 

Plus de Lex Nederbragt

Plus de Lex Nederbragt (10)

Coding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS eraCoding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS era
 
Why of version control
Why of version controlWhy of version control
Why of version control
 
Assembly: before and after
Assembly: before and afterAssembly: before and after
Assembly: before and after
 
Repeat after me: Is our research reproducible (enough)?
Repeat after me: Is our research reproducible (enough)? Repeat after me: Is our research reproducible (enough)?
Repeat after me: Is our research reproducible (enough)?
 
A different kettle of fish entirely: bioinformatic challenges and solutions f...
A different kettle of fish entirely: bioinformatic challenges and solutions f...A different kettle of fish entirely: bioinformatic challenges and solutions f...
A different kettle of fish entirely: bioinformatic challenges and solutions f...
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
 
How and why I use blogging
How and why I use bloggingHow and why I use blogging
How and why I use blogging
 
Assembly of metagenomes
Assembly of metagenomesAssembly of metagenomes
Assembly of metagenomes
 
NGS techniques and data
NGS techniques and data NGS techniques and data
NGS techniques and data
 
NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challenges
 

Dernier

Dernier (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Combining PacBio with short read technology for improved de novo genome assembly

  • 1. The best of both worlds Combining PacBio with short read technology for improved de novo genome assembly Lex Nederbragt, NSC and CEES lex.nederbragt@bio.uio.no
  • 3. Why does everybody want longer reads? … for genome assemblies
  • 4. What is a genome assembly Hierarchical structure reads contigs scaffolds
  • 5. Sequence data Reads reads contigs scaffolds original DNA fragments original DNA fragments Sequenced ends http://www.cbcb.umd.edu/research/assembly_primer.shtml
  • 6. Contigs Building contigs reads contigs scaffolds ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGATT TAGCGCATTACACAGATTAG Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
  • 7. Contigs Building contigs reads contigs scaffolds Repeat copy 1 Repeat copy 2 Contig orientation? Contig order? Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
  • 8. Mate pairs Other read type reads contigs scaffolds Repeat copy 1 Repeat copy 2 (much) longer fragments mate pair reads
  • 9. Scaffolds Ordered, oriented contigs reads contigs scaffolds mate pairs contigs gap size estimate
  • 10. What is a genome assembly Hierarchical structure reads ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA contigs CGTAGCGCATTACACAGATT TAGCGCATTACACAGATTAG Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG scaffolds
  • 11. Genome assembly So, what’s so hard about it?
  • 12. 1) Repeats reads contigs scaffolds Repeat copy 1 Repeat copy 2 Repeats break up contigs Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
  • 13. 2) Heterozygosity Differences between sister * chromosomes * * http://commons.wikimedia.org/wiki/File:Chromosome_1.svg
  • 14. 2) Heterozygosity Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  • 16. 3) Many programs to choose from Zhang et al. (2011) doi:10.1371/journal.pone.0017915.g001
  • 17. Assembly: challenges Repeat copy 1 Repeat copy 2 Knowing how to use the programs Heterozygosity Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  • 18. So, why does everybody want longer reads? http://www.autobizz.com.my/forum/forum/General-Chat/944-The-worlds-longest-car.html
  • 19. Longer reads? Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  • 20. PacBio to the rescue?
  • 21. High-throughput sequencing Library preparation SMRTBell'template' SMRTBell'template' Standard'Sequencing' Standard'Sequencing' Generates& pass& each& one& on& molecule& Insert& Large& Sizes& Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& Single pass sequenced& Circular'Consensus'Sequencing' Circular'Consensus'Sequencing' Continued generations of reads Small Insert Sizes& Small& Small& Insert& Sizes Insert& Sizes& Multiple mul8ple& passes passes& each& Generates& on& molecule& Generates& mul8ple& sequenced& passes& each& on& molecule& sequenced&
  • 22. High-throughput sequencing Raw read length
  • 23. High-throughput sequencing SMRTBell'template' Raw reads and subreads Standard'Sequencing' Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& ‘Subreads’ Circular'Consensus'Sequencing' Small Insert Sizes& Small&Insert& Sizes Multiple mul8ple& passes passes& each& Generates& on& molecule& sequenced&
  • 24. PacBio: uses SMRTBell'template' Long reads  low quality Standard'Sequencing' Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& 85-87% accuracy Circular'Consensus'Sequencing' Useful for assembly? Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule& sequenced&
  • 26. Solutions for assembly (1) Designed by Pacific Biosciences http://www.clker.com/clipart-4245.html
  • 27. Solutions for assembly (2) Broad Institute Need a special recipe for sequencing
  • 28. Solutions for assembly (3) PacBioToCA Error correct with short reads Celera assembler http://schatzlab.cshl.edu/presentations/2012-01-17.PAG.SMRTassembly.pdf
  • 29. PacBioToCA Koren et al, 2012
  • 31. Shameless self-promotion @lexnederbragt
  • 32. The Atlantic cod genome project
  • 33. First draft Fragmented assembly - short contigs - many gap bases http://en.wikipedia.org
  • 34. First draft 6467 scaffolds 35% gap bases
  • 35. The causes Short Tandem Repeats (>20% of gaps)
  • 36. The causes Heterozygosity? Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  • 37. The goal 23 pseudochromosomes Longer contigs Below 5% gap bases PacBio to the rescue?
  • 38. The approach SMRTBell'template' Libraries Standard'Sequencing' Generates& pass& ea one& on& Large Insert& Sizes Large& Sizes& Insert sequenced& Aim for looooong insert sizes Circular'Consensus'Sequencing' Small& Insert& Sizes& Generates& mul8ple& passes sequenced&
  • 39. SMRTBell'template' The approach Sequencing Standard'Sequencing' Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& Sequence with 90 minute movies Circular'Consensus'Sequencing' Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule& 10 x coverage in reads of at least 3000 bp sequenced& No, we don’t throw this away…
  • 41. PacBio results 100 Relative throughput at different minimum length cutoffs 10kb lib 2 Fraction of bases at minimum 10kb lib 1 length 4kb lib 80 Percentage of total sequence 60 40 20 0 0kbp 3kbp 5kbp 10kbp 15kbp Length cutoff longest subread Large library insert size important!
  • 42. chnology PacBio results SMRTBell'template' 64 SMRT Cells 3.2 Gigabytes in raw reads at least 3kb 3.8 x coverage 3 Standard'Sequencing' Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& 2.2 Gigabytes in longest subreads reads Circular'Consensus'Sequencing' Largest 15 kbp Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule&
  • 43. PacBio results Mapping to the cod genome 11.4 kbp subread 10.6 kbp subread 10.9 kbp subread
  • 44. Example 1 ACACAC repeat 232 bp Gap TGTGTG repeat
  • 47. Example 1 Scaffold ...ACACAC TGTGTG... PacBio reads Unplaced contig
  • 50. Example 2 Scaffold ...TGTGTG PacBio reads Heterozygosity?
  • 51. Example 3 Scaffold PacBio reads 300 bp misassembly?
  • 52. Error-correction Work In Progress http://openclipart.org/
  • 53. Outlook Will PacBio solve our problems?
  • 55. Outlook Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3 Will we find the heterozygous regions?