SlideShare une entreprise Scribd logo
1  sur  56
The best of both worlds
Combining PacBio with short read technology
  for improved de novo genome assembly

          Lex Nederbragt, NSC and CEES
           lex.nederbragt@bio.uio.no
This talk
Why does everybody want longer reads?


        … for genome assemblies
What is a genome assembly


    Hierarchical structure

reads

 contigs

   scaffolds
Sequence data

                           Reads
                                                    reads

                                                      contigs

                                                        scaffolds



original DNA

 fragments




original DNA

 fragments

                  Sequenced ends




               http://www.cbcb.umd.edu/research/assembly_primer.shtml
Contigs

                          Building contigs
                                                               reads

                                                                 contigs

                                                                   scaffolds


                 ACGCGATTCAGGTTACCACG
                   GCGATTCAGGTTACCACGCG
                     GATTCAGGTTACCACGCGTA
                       TTCAGGTTACCACGCGTAGC
                         CAGGTTACCACGCGTAGCGC
  Aligned reads            GGTTACCACGCGTAGCGCAT
                             TTACCACGCGTAGCGCATTA
                                ACCACGCGTAGCGCATTACA
                                  CACGCGTAGCGCATTACACA
                                    CGCGTAGCGCATTACACAGA
                                      CGTAGCGCATTACACAGATT
                                        TAGCGCATTACACAGATTAG
Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
Contigs

                          Building contigs
                                                                     reads

                                                                       contigs

                                                                         scaffolds




     Repeat copy 1                                    Repeat copy 2




                                                          Contig orientation?
                                                            Contig order?




Collapsed repeat
   consensus
                     http://www.cbcb.umd.edu/research/assembly_primer.shtml
Mate pairs

                          Other read type
                                                      reads

                                                        contigs

                                                          scaffolds




     Repeat copy 1                      Repeat copy 2




(much) longer fragments
                                            mate pair reads
Scaffolds

                 Ordered, oriented contigs
                                               reads

                                                 contigs

                                                   scaffolds




    mate pairs
contigs



                           gap size estimate
What is a genome assembly


    Hierarchical structure

reads                            ACGCGATTCAGGTTACCACG
                                   GCGATTCAGGTTACCACGCG
                                     GATTCAGGTTACCACGCGTA
                                       TTCAGGTTACCACGCGTAGC
                                         CAGGTTACCACGCGTAGCGC
                  Aligned reads            GGTTACCACGCGTAGCGCAT
                                             TTACCACGCGTAGCGCATTA
                                                ACCACGCGTAGCGCATTACA
                                                  CACGCGTAGCGCATTACACA
                                                    CGCGTAGCGCATTACACAGA

 contigs                                              CGTAGCGCATTACACAGATT
                                                        TAGCGCATTACACAGATTAG
                Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG




   scaffolds
Genome assembly




So, what’s so hard about it?
1) Repeats

                                                                     reads

                                                                       contigs

                                                                         scaffolds




     Repeat copy 1                                    Repeat copy 2




                                         Repeats break up contigs


Collapsed repeat
   consensus
                     http://www.cbcb.umd.edu/research/assembly_primer.shtml
2) Heterozygosity



                                                               Differences
                                                              between sister
                                                          *   chromosomes



                                                          *




                                                          *




http://commons.wikimedia.org/wiki/File:Chromosome_1.svg
2) Heterozygosity




             Polymorphic contig 2

Contig 1                            Contig 4
             Polymorphic contig 3
2) Heterozygosity




http://www.astraean.com/borderwars/wp-content/uploads/2012/04/heterozygoats.jpg
and many other sites
3) Many programs to choose from




Zhang et al. (2011) doi:10.1371/journal.pone.0017915.g001
Assembly: challenges
         Repeat copy 1                               Repeat copy 2




                         Knowing how to use the programs



Heterozygosity
                              Polymorphic contig 2

          Contig 1                                            Contig 4
                              Polymorphic contig 3
So, why does everybody want longer reads?




http://www.autobizz.com.my/forum/forum/General-Chat/944-The-worlds-longest-car.html
Longer reads?
Repeat copy 1                                 Repeat copy 2




    Long reads can span repeats and heterozygous regions




                       Polymorphic contig 2

 Contig 1                                              Contig 4
                       Polymorphic contig 3
PacBio to the rescue?
High-throughput sequencing

                           Library preparation

SMRTBell'template'
SMRTBell'template'




Standard'Sequencing'
Standard'Sequencing'

                                           Generates& pass& each&
                                                    one&  on&   molecule&
           Insert&
      Large&     Sizes&                    Generates& pass& each&
                                                    one&  on&   molecule&
     Large Insert& Sizes
      Large&     Sizes&
            Insert                         sequenced&
                                            Single pass
                                           sequenced&


Circular'Consensus'Sequencing'
Circular'Consensus'Sequencing'                               Continued generations
                                                             of reads

  Small Insert Sizes&
   Small&
   Small&
         Insert&
               Sizes
         Insert&
               Sizes&

                                           Multiple mul8ple&
                                                    passes passes& each&
                                           Generates&            on&   molecule&
                                           Generates&
                                                    mul8ple&
                                           sequenced&      passes& each&
                                                                 on&   molecule&
                                           sequenced&
High-throughput sequencing

      Raw read length
High-throughput sequencing
SMRTBell'template'

                           Raw reads and subreads

Standard'Sequencing'


                                            Generates& pass& each&
                                                     one&  on&   molecule&
     Large Insert& Sizes
      Large&     Sizes&
            Insert                           Single pass
                                            sequenced&


                                           ‘Subreads’
Circular'Consensus'Sequencing'



  Small Insert Sizes&
   Small&Insert&
               Sizes

                                            Multiple mul8ple&
                                                     passes passes& each&
                                            Generates&            on&   molecule&
                                            sequenced&
PacBio: uses
SMRTBell'template'

                           Long reads  low quality

Standard'Sequencing'


                                             Generates& pass& each&
                                                      one&  on&   molecule&
     Large Insert& Sizes
      Large&     Sizes&
            Insert                            Single pass
                                             sequenced&
                                               85-87% accuracy
Circular'Consensus'Sequencing'
                             Useful for assembly?
    Small&
         Insert&
               Sizes&


                                             Generates&
                                                      mul8ple&
                                                             passes& each&
                                                                   on&   molecule&
                                             sequenced&
Solutions for assembly
Solutions for assembly (1)




   Designed by Pacific Biosciences




http://www.clker.com/clipart-4245.html
Solutions for assembly (2)
   Broad Institute




Need a special recipe
  for sequencing
Solutions for assembly (3)

                 PacBioToCA
        Error correct with short reads




Celera assembler


   http://schatzlab.cshl.edu/presentations/2012-01-17.PAG.SMRTassembly.pdf
PacBioToCA




             Koren et al, 2012
Shameless self-promotion

flxlexblog.wordpress.com
Shameless self-promotion




            @lexnederbragt
The Atlantic cod genome project
First draft




Fragmented assembly
    - short contigs
    - many gap bases
                                http://en.wikipedia.org
First draft



6467 scaffolds




                   35% gap bases
The causes




Short Tandem Repeats (>20% of gaps)
The causes


           Heterozygosity?



            Polymorphic contig 2

Contig 1                           Contig 4
            Polymorphic contig 3
The goal



 23 pseudochromosomes




       Longer contigs




                        Below 5% gap bases



PacBio to the rescue?
The approach
 SMRTBell'template'


         Libraries

 Standard'Sequencing'


                                  Generates& pass& ea
                                           one&  on&
      Large Insert& Sizes
       Large&     Sizes&
             Insert               sequenced&


Aim for looooong insert sizes
 Circular'Consensus'Sequencing'


     Small&
          Insert&
                Sizes&


                                  Generates&
                                           mul8ple&
                                                  passes
                                  sequenced&
SMRTBell'template'        The approach

                                  Sequencing
      Standard'Sequencing'


                                                Generates& pass& each&
                                                         one&  on&   molecule&
          Large Insert& Sizes
           Large&     Sizes&
                 Insert                          Single pass
                                                sequenced&


    Sequence with 90 minute movies
     Circular'Consensus'Sequencing'


         Small&
              Insert&
                    Sizes&


                                                Generates&
                                                         mul8ple&
                                                                passes& each&
                                                                      on&   molecule&
10 x coverage in reads of at least 3000 bp      sequenced&




                No, we don’t throw this away…
The approach

Error-correction
PacBio results
                               100          Relative throughput at different minimum length cutoffs


                                                                                                      10kb lib 2
                                                                    Fraction of bases at minimum 10kb lib 1
                                                                                                 length
                                                                                                      4kb lib
                               80
Percentage of total sequence

                               60
                               40
                               20
                               0




                                     0kbp   3kbp      5kbp                                   10kbp              15kbp

                                                             Length cutoff longest subread


                                               Large library insert size important!
chnology

                                        PacBio results




              SMRTBell'template'
                 64 SMRT Cells
                                                    3.2 Gigabytes in raw reads at least 3kb
                                                                3.8 x coverage
                                                3




              Standard'Sequencing'


                                                        Generates& pass& each&
                                                                 one&  on&   molecule&
                  Large Insert& Sizes
                   Large&     Sizes&
                         Insert                         sequenced&


      2.2 Gigabytes in longest subreads reads
             Circular'Consensus'Sequencing'
                   Largest 15 kbp

                 Small&
                      Insert&
                            Sizes&


                                                        Generates&
                                                                 mul8ple&
                                                                        passes& each&
                                                                              on&   molecule&
PacBio results

Mapping to the cod genome
      11.4 kbp subread




       10.6 kbp subread




      10.9 kbp subread
Example 1


ACACAC repeat




232 bp Gap




TGTGTG repeat
Example 1
Example 1
Example 1
Scaffold               ...ACACAC     TGTGTG...

PacBio reads
               Unplaced contig
Example 2


TGTGTG repeat




     344 bp Gap
Example 2
Example 2

Scaffold       ...TGTGTG

PacBio reads

                     Heterozygosity?
Example 3

Scaffold


   PacBio reads
                  300 bp misassembly?
Error-correction




                          Work In Progress
http://openclipart.org/
Outlook




  Will PacBio solve our problems?
Outlook




  Or
Outlook



                Polymorphic contig 2

Contig 1                               Contig 4
                Polymorphic contig 3




   Will we find the heterozygous regions?
Outlook




 http://www.pasteur.fr/recherche/unites/Bbi/
 en.wikipedia.org
 and Martin Malmstrøm

Contenu connexe

Tendances

protein-protein interaction
protein-protein  interactionprotein-protein  interaction
protein-protein interactionZeshan Haider
 
Construction of genomic library in lambda
Construction of genomic library in lambdaConstruction of genomic library in lambda
Construction of genomic library in lambdaArchana Shaw
 
Chou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionChou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionRoshan Karunarathna
 
Scoring schemes in bioinformatics (blosum)
Scoring schemes in bioinformatics (blosum)Scoring schemes in bioinformatics (blosum)
Scoring schemes in bioinformatics (blosum)SumatiHajela
 
Tandem affinity purification
Tandem affinity purificationTandem affinity purification
Tandem affinity purificationRamish Saher
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGAayushi Pal
 
Cre- lox recombination
Cre- lox recombinationCre- lox recombination
Cre- lox recombinationPranavKhudania
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
Cytoscape basic features
Cytoscape basic featuresCytoscape basic features
Cytoscape basic featuresLuay AL-Assadi
 

Tendances (20)

Dot matrix seminar
Dot matrix seminarDot matrix seminar
Dot matrix seminar
 
Gene prediction strategies
Gene prediction strategies Gene prediction strategies
Gene prediction strategies
 
protein-protein interaction
protein-protein  interactionprotein-protein  interaction
protein-protein interaction
 
Construction of genomic library in lambda
Construction of genomic library in lambdaConstruction of genomic library in lambda
Construction of genomic library in lambda
 
Ramachandran plot
Ramachandran plotRamachandran plot
Ramachandran plot
 
Chou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionChou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure prediction
 
Knock out mice
Knock out miceKnock out mice
Knock out mice
 
Transcriptomics
TranscriptomicsTranscriptomics
Transcriptomics
 
High throughput sequencing
High throughput sequencingHigh throughput sequencing
High throughput sequencing
 
Genome Mapping
Genome MappingGenome Mapping
Genome Mapping
 
Scoring schemes in bioinformatics (blosum)
Scoring schemes in bioinformatics (blosum)Scoring schemes in bioinformatics (blosum)
Scoring schemes in bioinformatics (blosum)
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Sequence database
Sequence databaseSequence database
Sequence database
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Rna seq
Rna seqRna seq
Rna seq
 
Tandem affinity purification
Tandem affinity purificationTandem affinity purification
Tandem affinity purification
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCING
 
Cre- lox recombination
Cre- lox recombinationCre- lox recombination
Cre- lox recombination
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Cytoscape basic features
Cytoscape basic featuresCytoscape basic features
Cytoscape basic features
 

En vedette

IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataIonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataAdrian Baez-Ortega
 
NGS technologies - platforms and applications
NGS technologies - platforms and applicationsNGS technologies - platforms and applications
NGS technologies - platforms and applicationsAGRF_Ltd
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015Torsten Seemann
 
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Lex Nederbragt
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryJan Aerts
 
2014 June 17 PacBio User Group Meeting Presentation "How Looking for a Needle...
2014 June 17 PacBio User Group Meeting Presentation "How Looking for a Needle...2014 June 17 PacBio User Group Meeting Presentation "How Looking for a Needle...
2014 June 17 PacBio User Group Meeting Presentation "How Looking for a Needle...Anne Deslattes Mays
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_coursehansjansen9999
 
Improving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBioImproving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBioLex Nederbragt
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015Torsten Seemann
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Keith Bradnam
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomicsMads Albertsen
 
Semiconductor Sequencing Applications for Plant Sciences
Semiconductor Sequencing Applications for Plant SciencesSemiconductor Sequencing Applications for Plant Sciences
Semiconductor Sequencing Applications for Plant SciencesThermo Fisher Scientific
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesScott Edmunds
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 

En vedette (20)

IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataIonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
NGS technologies - platforms and applications
NGS technologies - platforms and applicationsNGS technologies - platforms and applications
NGS technologies - platforms and applications
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
 
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
 
2014 June 17 PacBio User Group Meeting Presentation "How Looking for a Needle...
2014 June 17 PacBio User Group Meeting Presentation "How Looking for a Needle...2014 June 17 PacBio User Group Meeting Presentation "How Looking for a Needle...
2014 June 17 PacBio User Group Meeting Presentation "How Looking for a Needle...
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
Improving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBioImproving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBio
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
20140711 3 t_clark_ercc2.0_workshop
20140711 3 t_clark_ercc2.0_workshop20140711 3 t_clark_ercc2.0_workshop
20140711 3 t_clark_ercc2.0_workshop
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics
 
Semiconductor Sequencing Applications for Plant Sciences
Semiconductor Sequencing Applications for Plant SciencesSemiconductor Sequencing Applications for Plant Sciences
Semiconductor Sequencing Applications for Plant Sciences
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 

Similaire à Combining PacBio with short read technology for improved de novo genome assembly

2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeLex Nederbragt
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishingNikolay Vyahhi
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
 
Genome Assembly copy
Genome Assembly   copyGenome Assembly   copy
Genome Assembly copyPradeep Kumar
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly ProblemMark Chang
 

Similaire à Combining PacBio with short read technology for improved de novo genome assembly (8)

2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
Genome Assembly copy
Genome Assembly   copyGenome Assembly   copy
Genome Assembly copy
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly Problem
 
Git Going With DVCS v1.5.2
Git Going With DVCS v1.5.2Git Going With DVCS v1.5.2
Git Going With DVCS v1.5.2
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 

Plus de Lex Nederbragt

Coding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS eraCoding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS eraLex Nederbragt
 
Why of version control
Why of version controlWhy of version control
Why of version controlLex Nederbragt
 
Assembly: before and after
Assembly: before and afterAssembly: before and after
Assembly: before and afterLex Nederbragt
 
Repeat after me: Is our research reproducible (enough)?
Repeat after me: Is our research reproducible (enough)? Repeat after me: Is our research reproducible (enough)?
Repeat after me: Is our research reproducible (enough)? Lex Nederbragt
 
A different kettle of fish entirely: bioinformatic challenges and solutions f...
A different kettle of fish entirely: bioinformatic challenges and solutions f...A different kettle of fish entirely: bioinformatic challenges and solutions f...
A different kettle of fish entirely: bioinformatic challenges and solutions f...Lex Nederbragt
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...Lex Nederbragt
 
How and why I use blogging
How and why I use bloggingHow and why I use blogging
How and why I use bloggingLex Nederbragt
 
Assembly of metagenomes
Assembly of metagenomesAssembly of metagenomes
Assembly of metagenomesLex Nederbragt
 
NGS techniques and data
NGS techniques and data NGS techniques and data
NGS techniques and data Lex Nederbragt
 
NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challengesLex Nederbragt
 

Plus de Lex Nederbragt (10)

Coding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS eraCoding & Best Practice in Programming in the NGS era
Coding & Best Practice in Programming in the NGS era
 
Why of version control
Why of version controlWhy of version control
Why of version control
 
Assembly: before and after
Assembly: before and afterAssembly: before and after
Assembly: before and after
 
Repeat after me: Is our research reproducible (enough)?
Repeat after me: Is our research reproducible (enough)? Repeat after me: Is our research reproducible (enough)?
Repeat after me: Is our research reproducible (enough)?
 
A different kettle of fish entirely: bioinformatic challenges and solutions f...
A different kettle of fish entirely: bioinformatic challenges and solutions f...A different kettle of fish entirely: bioinformatic challenges and solutions f...
A different kettle of fish entirely: bioinformatic challenges and solutions f...
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
 
How and why I use blogging
How and why I use bloggingHow and why I use blogging
How and why I use blogging
 
Assembly of metagenomes
Assembly of metagenomesAssembly of metagenomes
Assembly of metagenomes
 
NGS techniques and data
NGS techniques and data NGS techniques and data
NGS techniques and data
 
NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challenges
 

Dernier

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Dernier (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 

Combining PacBio with short read technology for improved de novo genome assembly

  • 1. The best of both worlds Combining PacBio with short read technology for improved de novo genome assembly Lex Nederbragt, NSC and CEES lex.nederbragt@bio.uio.no
  • 3. Why does everybody want longer reads? … for genome assemblies
  • 4. What is a genome assembly Hierarchical structure reads contigs scaffolds
  • 5. Sequence data Reads reads contigs scaffolds original DNA fragments original DNA fragments Sequenced ends http://www.cbcb.umd.edu/research/assembly_primer.shtml
  • 6. Contigs Building contigs reads contigs scaffolds ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA CGTAGCGCATTACACAGATT TAGCGCATTACACAGATTAG Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG
  • 7. Contigs Building contigs reads contigs scaffolds Repeat copy 1 Repeat copy 2 Contig orientation? Contig order? Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
  • 8. Mate pairs Other read type reads contigs scaffolds Repeat copy 1 Repeat copy 2 (much) longer fragments mate pair reads
  • 9. Scaffolds Ordered, oriented contigs reads contigs scaffolds mate pairs contigs gap size estimate
  • 10. What is a genome assembly Hierarchical structure reads ACGCGATTCAGGTTACCACG GCGATTCAGGTTACCACGCG GATTCAGGTTACCACGCGTA TTCAGGTTACCACGCGTAGC CAGGTTACCACGCGTAGCGC Aligned reads GGTTACCACGCGTAGCGCAT TTACCACGCGTAGCGCATTA ACCACGCGTAGCGCATTACA CACGCGTAGCGCATTACACA CGCGTAGCGCATTACACAGA contigs CGTAGCGCATTACACAGATT TAGCGCATTACACAGATTAG Consensus contig ACGCGATTCAGGTTACCACGCGTAGCGCATTACACAGATTAG scaffolds
  • 11. Genome assembly So, what’s so hard about it?
  • 12. 1) Repeats reads contigs scaffolds Repeat copy 1 Repeat copy 2 Repeats break up contigs Collapsed repeat consensus http://www.cbcb.umd.edu/research/assembly_primer.shtml
  • 13. 2) Heterozygosity Differences between sister * chromosomes * * http://commons.wikimedia.org/wiki/File:Chromosome_1.svg
  • 14. 2) Heterozygosity Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  • 16. 3) Many programs to choose from Zhang et al. (2011) doi:10.1371/journal.pone.0017915.g001
  • 17. Assembly: challenges Repeat copy 1 Repeat copy 2 Knowing how to use the programs Heterozygosity Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  • 18. So, why does everybody want longer reads? http://www.autobizz.com.my/forum/forum/General-Chat/944-The-worlds-longest-car.html
  • 19. Longer reads? Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  • 20. PacBio to the rescue?
  • 21. High-throughput sequencing Library preparation SMRTBell'template' SMRTBell'template' Standard'Sequencing' Standard'Sequencing' Generates& pass& each& one& on& molecule& Insert& Large& Sizes& Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& Single pass sequenced& Circular'Consensus'Sequencing' Circular'Consensus'Sequencing' Continued generations of reads Small Insert Sizes& Small& Small& Insert& Sizes Insert& Sizes& Multiple mul8ple& passes passes& each& Generates& on& molecule& Generates& mul8ple& sequenced& passes& each& on& molecule& sequenced&
  • 22. High-throughput sequencing Raw read length
  • 23. High-throughput sequencing SMRTBell'template' Raw reads and subreads Standard'Sequencing' Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& ‘Subreads’ Circular'Consensus'Sequencing' Small Insert Sizes& Small&Insert& Sizes Multiple mul8ple& passes passes& each& Generates& on& molecule& sequenced&
  • 24. PacBio: uses SMRTBell'template' Long reads  low quality Standard'Sequencing' Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& 85-87% accuracy Circular'Consensus'Sequencing' Useful for assembly? Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule& sequenced&
  • 26. Solutions for assembly (1) Designed by Pacific Biosciences http://www.clker.com/clipart-4245.html
  • 27. Solutions for assembly (2) Broad Institute Need a special recipe for sequencing
  • 28. Solutions for assembly (3) PacBioToCA Error correct with short reads Celera assembler http://schatzlab.cshl.edu/presentations/2012-01-17.PAG.SMRTassembly.pdf
  • 29. PacBioToCA Koren et al, 2012
  • 31. Shameless self-promotion @lexnederbragt
  • 32. The Atlantic cod genome project
  • 33. First draft Fragmented assembly - short contigs - many gap bases http://en.wikipedia.org
  • 34. First draft 6467 scaffolds 35% gap bases
  • 35. The causes Short Tandem Repeats (>20% of gaps)
  • 36. The causes Heterozygosity? Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3
  • 37. The goal 23 pseudochromosomes Longer contigs Below 5% gap bases PacBio to the rescue?
  • 38. The approach SMRTBell'template' Libraries Standard'Sequencing' Generates& pass& ea one& on& Large Insert& Sizes Large& Sizes& Insert sequenced& Aim for looooong insert sizes Circular'Consensus'Sequencing' Small& Insert& Sizes& Generates& mul8ple& passes sequenced&
  • 39. SMRTBell'template' The approach Sequencing Standard'Sequencing' Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert Single pass sequenced& Sequence with 90 minute movies Circular'Consensus'Sequencing' Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule& 10 x coverage in reads of at least 3000 bp sequenced& No, we don’t throw this away…
  • 41. PacBio results 100 Relative throughput at different minimum length cutoffs 10kb lib 2 Fraction of bases at minimum 10kb lib 1 length 4kb lib 80 Percentage of total sequence 60 40 20 0 0kbp 3kbp 5kbp 10kbp 15kbp Length cutoff longest subread Large library insert size important!
  • 42. chnology PacBio results SMRTBell'template' 64 SMRT Cells 3.2 Gigabytes in raw reads at least 3kb 3.8 x coverage 3 Standard'Sequencing' Generates& pass& each& one& on& molecule& Large Insert& Sizes Large& Sizes& Insert sequenced& 2.2 Gigabytes in longest subreads reads Circular'Consensus'Sequencing' Largest 15 kbp Small& Insert& Sizes& Generates& mul8ple& passes& each& on& molecule&
  • 43. PacBio results Mapping to the cod genome 11.4 kbp subread 10.6 kbp subread 10.9 kbp subread
  • 44. Example 1 ACACAC repeat 232 bp Gap TGTGTG repeat
  • 47. Example 1 Scaffold ...ACACAC TGTGTG... PacBio reads Unplaced contig
  • 50. Example 2 Scaffold ...TGTGTG PacBio reads Heterozygosity?
  • 51. Example 3 Scaffold PacBio reads 300 bp misassembly?
  • 52. Error-correction Work In Progress http://openclipart.org/
  • 53. Outlook Will PacBio solve our problems?
  • 55. Outlook Polymorphic contig 2 Contig 1 Contig 4 Polymorphic contig 3 Will we find the heterozygous regions?