SlideShare une entreprise Scribd logo
1  sur  59

Metagenome assembly – part II
C. Titus Brown
ctb@msu.edu
Warnings

This talk contains forward looking statements. These forward-
   looking statements can be identified by terminology such as
                  “will”, “expects”, and “believes”.

                                      -- Safe Harbor provisions of the

                                  U.S. Private Securities Litigation Act



         “Making predictions is difficult, especially if

                  they’re about the future.”

                                            -- Attributed to Niels Bohr
The computational conundrum



               More data => better.


and


  More data => computationally more challenging.
Reads vs edges (memory) in de Bruijn graphs




           Conway T C , Bromage A J Bioinformatics 2011;27:479-486


© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
 please email: journals.permissions@oup.com
2. Big data sets require big machines
 For even relatively small data sets, metagenomic assemblers scale
     poorly.

 Memory usage ~ “real” variation + number of errors

 Number of errors ~ size of data set

 Size of data set == big!!



 (Estimated 6 weeks x 3 TB of RAM to do 300gb soil sample, with a
      slightly modified conventional assembler.)
Soil is full of uncultured microbes




                          Randy Jackson
SAMPLING LOCATIONS
Great Prairie sampling design
Reference core

                   1 cM        1M

                 1 cM                                               10 M


                 1M


                        Soil cores: 1 inch diameter 4 inches deep
                        (litter and roots removed)
                        • Spatial samples: 16S rRNA, nifH
                        • Reference sample sequenced (small
                            unmixed sample)
                            Reference bulk soil: stored for additional
                            “omics” and metadata

                 10 M
Soil contains thousands to millions of species
                                                  (“Collector’s curves” of ~species)

                 2000


                 1800

                 1600
Number of OTUs




                 1400                                                                                            Iowa Corn
                                                                                                                 Iowa_Native_Prairie
                 1200
                                                                                                                 Kansas Corn

                 1000                                                                                            Kansas_Native_Prairie
                                                                                                                 Wisconsin Corn
                 800                                                                                             Wisconsin Native Prairie
                                                                                                                 Wisconsin Restored Prairie
                 600
                                                                                                                 Wisconsin Switchgrass

                 400

                 200

                    0
                        100   600   1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100



                                                      Number of Sequences
The set of questions for soil -- discovery

       What’s there?

       Is it really that complex a community?

       How “deep” do we need to sequence to sample thoroughly and
        systematically?

       What organisms and gene functions are present, including non-
        canonical carbon and nitrogen cycling pathways?

       What kind of organismal and functional overlap is there between
        different sites? (Total sampling needed?)

       How is ecological complexity created & maintained?

       How does ecological complexity respond to perturbation?
Why are we applying short-read
          sequencing to this problem!?
   Short-read sampling is deep and quantitative.
     Statistical argument: your ability to observe rare
        organisms – your sensitivity of measurement – is directly
        related to the number of independent sequences you
        take.
     Longer reads (PacBio, 454, Ion Torrent) are less
        informative.

   Majority of metagenome studies going forward will make use
    of Illumina.

   BUT this kind of sequence is challenging to analyze.

   BUT, BUT this kind of sequence is necessary for high
    complexity environments.
Challenges of short-read analysis

  Low signal for functional analysis; no linkage at all.

  High error rates.

  Massive volume.

  Rapidly changing technology.



  Several approaches but we have settled on
    assembly.
Our “Grand Challenge” dataset
                                               Total: 1,846 Gbp soil metagenome
                                600                                                       MetaHIT (Qin et. al, 2011), 578 Gbp

                                500
Basepairs of Sequencing (Gbp)




                                400


                                300                                                       Rumen (Hess et. al, 2011), 268 Gbp


                                200                                                                    Rumen K-mer Filtered,
                                                                                                       111 Gbp
                                100                                                                               NCBI nr database,
                                                                                                                  37 Gbp
                                  0
                                        Iowa,    Iowa, Native Kansas,      Kansas,   Wisconsin, Wisconsin, Wisconsin, Wisconsin,
                                      Continuous    Prairie   Cultivated   Native    Continuous  Native    Restored Switchgrass
                                         corn                   corn       Prairie     corn      Prairie    Prairie
                                                                           GAII   HiSeq
Approach 1: Partitioning

Split reads into “bins”
     belonging to different
     source species.
Can do this based almost
   entirely on
   connectivity of
   sequences.
Partitioning for scaling

   Can be done in ~10x less memory than assembly.

   Partition at low k and assemble exactly at any higher k (DBG).

   Partitions can then be assembled independently
       Multiple processors -> scaling
       Multiple k, coverage -> improved assembly
       Multiple assembly packages (tailored to high variation, etc.)


   Can eliminate small partitions/contigs in the partitioning
    phase.

   An incredibly convenient approach enabling divide & conquer
    approaches across the board.
Technical challenges met (and defeated)

 Novel data structure properties elucidated via
   percolation theory analysis (Pell et al., PNAS, 2012)

 Exhaustive in-memory traversal of graphs
   containing 5-15 billion nodes.

 Sequencing technology introduces false
   connections in graph (Howe et al., in prep.)

 Only 20x improvement in assembly scaling .
(NOVEL)


Approach 2: Digital normalization


                         Suppose you have a
                     dilution factor of A (10) to
                     B(1). To get 10x of B you
                       need to get 100x of A!
                              Overkill!!

                     This 100x will consume disk
                       space and, because of
                           errors, memory.
Digital normalization discards redundant
                 reads prior to assembly.




       This removes reads and decreases data size, eliminates errors from
                removed reads, and normalizes coverage across loci.
Digital normalization algorithm


for read in dataset:
     if median_kmer_count(read) < CUTOFF:
             update_kmer_counts(read)
             save(read)
     else:
             # discard read


                Note, single pass; fixed memory.
Downsample based on de Bruijn graph
structure (which can be derived online)
Shotgun data is often (1) high coverage
           and (2) biased in coverage.

                                 (MD amplified)
Digital normalization fixes all that.

                          Normalizes coverage

                          Discards redundancy

                          Eliminates majority of
                          errors

                          Scales assembly dramatical

                          Assembly is 98% identical.
Digital normalization retains information, while
                     discarding data and errors
Other key points
   Virtually identical contig assembly; scaffolding works but is not yet
    cookie-cutter.



   Digital normalization changes the way de Bruijn graph assembly scales
    from the size of your data set to the size of the source sample.



   Always lower memory than assembly: we never collect most erroneous
    k-mers.



   Digital normalization can be done once – and then assembly parameter
    exploration can be done.
Quotable quotes.

Comment: “This looks like a great solution for people who can’t
                     afford real computers”.


                            OK, but:


“Buying ever bigger computers is a great solution for people who
                     don’t want to think hard.”


  To be less snide: both kinds of scaling are needed, of course.
Why use diginorm?

   Use the cloud to assemble any microbial genomes incl. single-
    cell, many eukaryotic genomes, most mRNAseq, and many
    metagenomes.



   Seems to provide leverage on addressing many biological or
    sample prep problems (single-cell & genome amplification
    MDA; metagenome; heterozygosity).



   And, well, the general idea of locus specific graph analysis
    solves lots of things…
Some interim concluding thoughts
      Digital normalization-like approaches provide a path to solving
       the majority of assembly scaling problems, and will enable
       assembly on current cloud computing hardware.
          This is not true for highly diverse metagenome environments…
          For soil, we estimate that we need 50 Tbp / gram soil. Sigh.



      Biologists and bioinformaticians hate:
          Throwing away data
          Caveats in bioinformatics papers (which reviewers like, note)



      Digital normalization also discards abundance information.
Evaluating sensitivity & specificity

             E. coli @ 10x + soil




   Digital
                                             Velvet         minimus2
normalization             Partitioning
                                         k from 19-51        merge
+ other filters




                                                        98.5% of E. coli
Example
Dethlefsen shotgun data set / Relman lab

251 m reads / 16gb FASTQ gzipped
~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 on Amazon EC2
    (reads => final assembly + mapping)

    Assembly stats:
              58,224 contigs > 1000 bp (average 3kb)
                   summing to 190 mb genomic
               ~38 microbial genomes worth of DNA
             ~65% of reads mapped back to assembly
What do we get for soil?
                                              Predicted
  Total                        % Reads
             Total Contigs                     protein     rplb genes
Assembly                      Assembled
                                               coding


2.5 bill       4.5 mill          19%          5.3 mill        391


3.5 bill       5.9 mill          22%          6.8 mill        466

                                  This estimates number of species ^
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes               Adina Howe
Human genome ~3 billion bp
Coverage of Assemblies




      Corn   Prairie
Nearest reference in NCBI
Most abundant contigs in Iowa corn metagenome:
Unknown; alpha/beta hydrolase (Streptomyces sp. S4); unknown;
unknown; hypothetical protein HMP (Clostridium clostridioforme)

Most abundant contigs in Iowa prairie metagenome:
hypothetical protein (Rhodanobacter sp. 2APBS1); hypothetical protein
(Oryza sativa Japonica); outer membrane adhesin like proteiin (Solitalea
canadensis) ; alcohol dehydrogenase zinc-binding domain protein
(Ktedonobacter racemifer); alcohol dehydrogenase GroES domain protein
(Ktedonobacter racemifer)
(Done with MEGAN)
How many soil samples do we need to
                              sequence??
              Overlap between Iowa prairie & Iowa corn is
                             significant!




                       (Cumulative)
Adina Howe
Extracting whole genomes?
So far, we have only assembled contigs, but not whole
    genomes.

Can entire genomes be
assembled from metagenomic
data?

Iverson et al. (2012), from
    the Armbrust lab, contains a
technique for scaffolding
metagenome contigs into
~whole genomes. YES.
Perspective: the coming infopocalypse

 Assembling about $20k worth of data, we can generate
   approximately 700 microbial genomes worth of data.
   (This is only going to go up in yield/$$, note.)


 Most of these assembled genomic contigs
(and genes) do not belong to studied
organisms.


 What the heck do they do??
More thoughts on assembly
   Illumina is the only game in town for sequencing complex
    microbial populations, but dealing with the data (volume, errors)
    is tricky. This problem is being solved, by us and others.

   We’re working to make it as close to push button as
    possible, with objectively argued parameters and tools, and
    methods for evaluating new tools and sequencing types.

   The community is working on dealing with data downstream of
    sequencing & assembly.
       Most pipelines were built around 454 data – long reads, and
        relatively few of them.
       With Illumina, we can get both long contigs and quantitative
        information about their abundance. This necessitates changes to
        pipelines like MG-RAST and HUMANn.
The interpretation challenge
   For soil, we have generated approximately 1200 bacterial
    genomes worth of assembled genomic DNA from two soil
    samples.

   The vast majority of this genomic DNA contains unknown genes
    with largely unknown function.

   Most annotations of gene function & interaction are from a few
    phylogenetically limited model organisms
         Est 98% of annotations are computationally inferred: transferred
          from model organisms to genomic sequence, using homology.
         Can these annotations be transferred? (Probably not.)

        This will be the biggest sequence analysis challenge of the next 50
                                         years.
Concluding thoughts on “assembly”
       We can handle all the data (modulo another year or so of
        engineering.) Bring it on!

       Our approaches let us (& you) assemble pretty much anything, much
        more easily than before. (Single cell, microbial
        genomes, transcriptomes, eukaryotic genomes, metagenomes, BAC
        sequencing…)

       Seriously. No more problemo. Done. Finished. Kaput.



       So now what?
           Validation.
           Interpretation and building general tools.
           Interpretation relies on annotation… (Uh oh.)
What are future needs?

   High-quality, medium+ throughput annotation of genomes?
        Extrapolating from model organisms is both immensely
         important and yet lacking.
        Strong phylogenetic sampling bias in existing annotations.



   Synthetic biology for investigating non-model organisms?
        (Cleverness in experimental biology doesn’t scale )


   Integration of microbiology, community ecology/evolution
    modeling, and data analysis.
Replication fu

   In December 2011, I met Wes McKinney on a train and he
    convinced me that I should look at IPython Notebook.


   This is an interactive Web notebook for data analysis…


   Hey, neat! We can use this for replication!
       All of our figures can be regenerated from scratch, on an EC2
        instance, using a Makefile (data pipeline) and IPython
        Notebook (figure generation).
       Everything is version controlled.
       Honestly not much work, and will be less the next time.
So… how’d that go?
   People who already cared thought it was nifty.

                     http://ivory.idyll.org/blog/replication-i.html

   Almost nobody else cares ;(
         Presub enquiry to editor: “Be sure that your paper can be reproduced.” Uh, please
          read my letter to the end?
         “Could you improve your Makefile? I want to reimplement diginorm in another
          language and reuse your pipeline, but your Makefile is a mess.”

   Incredibly useful, nonetheless. Already part of undergraduate and graduate training
    in my lab; helping us and others with next parpes; etc. etc. etc.



Life is way too short to waste on unnecessarily replicating your own workflows, much
                                    less other people’s.
Acknowledgements
                                       Collaborators
Lab members involved               Jim Tiedje, MSU
   Adina Howe (w/Tiedje)
   Jason Pell                     Billie Swalla, UW
   Arend Hintze
                                   Janet Jansson, LBNL
   Rosangela Canino-Koning
   Qingpeng Zhang                 Susannah Tringe, JGI
   Elijah Lowe
   Likit Preeyanon
   Jiarong Guo                 Funding
   Tim Brom                        USDA NIFA; NSF IOS;
   Kanchan Pavangadkar
   Eric McDonald                        BEACON.

Current research in my lab
Solving the rest of your problems 




                           Preliminary functional analysis
Search SSU rRNA gene in Illumina data

      1.   Randomly sequencing about 100bp long DNA in
           microbial genomes;

      2.   Everything is sequenced;

      3.   Not limited by primers or PCR bias;

      4.   Data mining is the challenge;
SSU rRNA Gene length
                   10^3
                            10^7      10^4
                  10^6
Genome length
                            Reads #          Expected SSU RNA gene
                                             fragments
Classification: Pyrotag vs shotgun



                            RDP-pyrotag-SSU
                            silva-pyrotag-SSU
                            silva-shotgun-SSU
1542 bp
                                             Forward

                                 Start:907                            End:1402

                                                              Reverse

      Sequence logo of short reads at           Sequence logo of short reads at
      forward primer region:                    reverse primer region:




      AAACTYAAAKGAATTGACGG                        GYACACACCGCCCGT
      Current forward primer                      Current reverse primer
                                                  (reverse complement)

Primers used in 454 Titanium sequencing of SSU rRNA gene, using
E.coli as an example. Consensus sequences of the primer region from
Illumina reads suggest 1) searching method is good and 2)primer bias
is minimal at the current E-value cutoff.
CowRumen – JGI 16s primer mismatches
      postion        A       T       C       G     Total
          1G    0.001    0.001   0.002   0.996    12154
           2T   0.002    0.983   0.003   0.012    12169
          3G    0.001    0.001   0.002   0.995    12166
          4C    0.001    0.001   0.996   0.002    12143
          5C    0.003    0.001   0.994   0.002    12183
          6A    0.986        0   0.008   0.005    12209
          7G    0.001    0.001   0.002   0.996    12189
          8C    0.001    0.001   0.996   0.002    12198
          9A    0.978    0.001   0.017   0.004   12230
        10G     0.001        0   0.002   0.997    12231
         11C    0.001    0.001   0.996   0.002    12198
         12C    0.002    0.001   0.994   0.003    12185
        13G          0       0   0.002   0.997    12190
         14C    0.001    0.001   0.995   0.003    12195
        15G     0.001    0.001       0   0.998    12213
        16G     0.001    0.001       0   0.998    12206
         17T    0.002    0.974   0.003   0.021    12171
         18A      0.99   0.001   0.006   0.003    12150
         19A    0.995    0.001   0.002   0.002    12106
Running HMMs over de Bruijn graphs
                                                        (=> cross validation)

                                               hmmgs: Assemble based
                                                 on good-scoring HMM
                                                 paths through the graph.

                                               Independent of other
                                                 assemblers; very
                                                 sensitive, specific.

                                               95% of hmmgs rplB
                                                 domains are present in
                                                 our partitioned
                                                 assemblies.

Jordan Fish, Qiong Wang, and Jim Cole (RDP)
Streaming error correction.
                          First pass                                               Second pass




                                        Error-correct low-                                       Error-correct low-
 All reads                Yes!         abundance k-mers in                           Yes!       abundance k-mers in
                                              read.                                                    read.

             Does read come                                            Does read come
               from a high-                                            from a now high-
             coverage locus?                                           coverage locus?
                                        Add read to graph
                                                                                                 Leave unchanged.
                                        and save for later.
                                                              Only saved reads
                               No!                                                        No!




                 We can do error trimming of
  genomic, MDA, transcriptomic, metagenomic data in < 2
                    passes, fixed memory.
We have just submitted a proposal to adapt Euler or Quake-like
  error correction (e.g. spectral alignment problem) to this
Side note: error correction is the
   biggest “data” problem left in
                     sequencing.




  Both for mapping & assembly.
1542 bp
                                  Forward
                              Start:907              End:1402



Consensus of short reads at            Consensus of short reads at
forward primer region:                 reverse primer region:




AAACTYAAAKGAATTGACGG
Current forward primer



 Figure. Primers used in 454 Titanium sequencing
of 16S rRNA gene, using E.coli as an example.
Consensus sequences of the primer region from
Illumina reads suggest primer bias is minimal at the
current E-value cutoff.
Supplemental: abundance filtering is very lossy.


           Percent loss from abundance filtering (all >= 2)

Largest partition

   8.2x partition


   3.8x partition                                                 contigs
                                                                  bp
            Total


                    0.0   20.0   40.0     60.0     80.0   100.0
                                 Percentage lost
Comparing assemblers
Comparing assemblies / dendrogram
Integrating modeling into data analysis?

Contenu connexe

En vedette

Motoholics Sponsorship Proposal 2010
Motoholics Sponsorship Proposal 2010Motoholics Sponsorship Proposal 2010
Motoholics Sponsorship Proposal 2010Gaurab Dutta
 
Avysta Presentation
Avysta PresentationAvysta Presentation
Avysta Presentationguest95d5ba
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
Marketing Your Message Literacy Program Sustainability
Marketing Your Message   Literacy Program SustainabilityMarketing Your Message   Literacy Program Sustainability
Marketing Your Message Literacy Program SustainabilitySarah Halstead
 
Interactive NETS*S Workshop, ISTE 2011
Interactive NETS*S Workshop, ISTE 2011Interactive NETS*S Workshop, ISTE 2011
Interactive NETS*S Workshop, ISTE 2011arowland1313
 
父母恩重難報經
父母恩重難報經父母恩重難報經
父母恩重難報經tina59520
 
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...Circles of San Antonio Community Coalition
 
Castello Di Paternò
Castello  Di PaternòCastello  Di Paternò
Castello Di PaternòYvonne Sgroi
 
Cloudxp keynote 19 sept pvu
Cloudxp keynote 19 sept pvuCloudxp keynote 19 sept pvu
Cloudxp keynote 19 sept pvuPiet van Vugt
 
Keynote nobel doc experience
Keynote nobel doc experienceKeynote nobel doc experience
Keynote nobel doc experiencePiet van Vugt
 
Recount of trip to Howick Historical Village
Recount of trip to Howick Historical VillageRecount of trip to Howick Historical Village
Recount of trip to Howick Historical VillageTakahe One
 
News and Views of the Portage County Literacy Council
News and Views of the Portage County Literacy CouncilNews and Views of the Portage County Literacy Council
News and Views of the Portage County Literacy CouncilSarah Halstead
 

En vedette (20)

Motoholics Sponsorship Proposal 2010
Motoholics Sponsorship Proposal 2010Motoholics Sponsorship Proposal 2010
Motoholics Sponsorship Proposal 2010
 
Cope Manifesto
Cope ManifestoCope Manifesto
Cope Manifesto
 
Avysta Presentation
Avysta PresentationAvysta Presentation
Avysta Presentation
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
Marketing Your Message Literacy Program Sustainability
Marketing Your Message   Literacy Program SustainabilityMarketing Your Message   Literacy Program Sustainability
Marketing Your Message Literacy Program Sustainability
 
Ramifications of New USEPA
Ramifications of New USEPARamifications of New USEPA
Ramifications of New USEPA
 
Interactive NETS*S Workshop, ISTE 2011
Interactive NETS*S Workshop, ISTE 2011Interactive NETS*S Workshop, ISTE 2011
Interactive NETS*S Workshop, ISTE 2011
 
3 Hr Workbook
3 Hr Workbook3 Hr Workbook
3 Hr Workbook
 
父母恩重難報經
父母恩重難報經父母恩重難報經
父母恩重難報經
 
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
 
Castello Di Paternò
Castello  Di PaternòCastello  Di Paternò
Castello Di Paternò
 
Coke
CokeCoke
Coke
 
Cloudxp keynote 19 sept pvu
Cloudxp keynote 19 sept pvuCloudxp keynote 19 sept pvu
Cloudxp keynote 19 sept pvu
 
Promotional Gaming
Promotional GamingPromotional Gaming
Promotional Gaming
 
TST Social Host Webinar- Michael Sparks June 13, 2014
TST Social Host Webinar- Michael Sparks June 13, 2014TST Social Host Webinar- Michael Sparks June 13, 2014
TST Social Host Webinar- Michael Sparks June 13, 2014
 
18 Di Concetta
18 Di Concetta18 Di Concetta
18 Di Concetta
 
Keynote nobel doc experience
Keynote nobel doc experienceKeynote nobel doc experience
Keynote nobel doc experience
 
Recount of trip to Howick Historical Village
Recount of trip to Howick Historical VillageRecount of trip to Howick Historical Village
Recount of trip to Howick Historical Village
 
News and Views of the Portage County Literacy Council
News and Views of the Portage County Literacy CouncilNews and Views of the Portage County Literacy Council
News and Views of the Portage County Literacy Council
 

Similaire à 2012 stamps-mbl-2

2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattlec.titus.brown
 
Talk at 2012 Notre Dame Collab Computing Lab workshop
Talk at 2012 Notre Dame Collab Computing Lab workshopTalk at 2012 Notre Dame Collab Computing Lab workshop
Talk at 2012 Notre Dame Collab Computing Lab workshopc.titus.brown
 
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)Dag Endresen
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meetingDatabase talk for Bits & Bites meeting
Database talk for Bits & Bites meetingKeith Bradnam
 
Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)
Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)
Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)Dag Endresen
 
39 Matthew Blair Tli Objective3 Phase Ii Work Plan
39 Matthew Blair Tli Objective3 Phase Ii Work Plan39 Matthew Blair Tli Objective3 Phase Ii Work Plan
39 Matthew Blair Tli Objective3 Phase Ii Work PlanWorld Agroforestry (ICRAF)
 
The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...Borlaug Global Rust Initiative
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished? Keith Bradnam
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Copenhagenomics
 
Ramil Mauleon: IRRI GALAXY: bioinformatics for rice scientists
Ramil Mauleon: IRRI GALAXY: bioinformatics for rice scientistsRamil Mauleon: IRRI GALAXY: bioinformatics for rice scientists
Ramil Mauleon: IRRI GALAXY: bioinformatics for rice scientistsGigaScience, BGI Hong Kong
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...Larry Smarr
 

Similaire à 2012 stamps-mbl-2 (16)

2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle
 
Talk at 2012 Notre Dame Collab Computing Lab workshop
Talk at 2012 Notre Dame Collab Computing Lab workshopTalk at 2012 Notre Dame Collab Computing Lab workshop
Talk at 2012 Notre Dame Collab Computing Lab workshop
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meetingDatabase talk for Bits & Bites meeting
Database talk for Bits & Bites meeting
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)
Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)
Trait data mining seminar at the Carlsberg research institute (CRI) (4 Nov 2009)
 
39 Matthew Blair Tli Objective3 Phase Ii Work Plan
39 Matthew Blair Tli Objective3 Phase Ii Work Plan39 Matthew Blair Tli Objective3 Phase Ii Work Plan
39 Matthew Blair Tli Objective3 Phase Ii Work Plan
 
The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished?
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
 
Ramil Mauleon: IRRI GALAXY: bioinformatics for rice scientists
Ramil Mauleon: IRRI GALAXY: bioinformatics for rice scientistsRamil Mauleon: IRRI GALAXY: bioinformatics for rice scientists
Ramil Mauleon: IRRI GALAXY: bioinformatics for rice scientists
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
 

Plus de c.titus.brown

Plus de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 

Dernier

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

2012 stamps-mbl-2

  • 1.  Metagenome assembly – part II C. Titus Brown ctb@msu.edu
  • 2. Warnings This talk contains forward looking statements. These forward- looking statements can be identified by terminology such as “will”, “expects”, and “believes”. -- Safe Harbor provisions of the U.S. Private Securities Litigation Act “Making predictions is difficult, especially if they’re about the future.” -- Attributed to Niels Bohr
  • 3. The computational conundrum More data => better. and More data => computationally more challenging.
  • 4. Reads vs edges (memory) in de Bruijn graphs Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 5. 2. Big data sets require big machines For even relatively small data sets, metagenomic assemblers scale poorly. Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set Size of data set == big!! (Estimated 6 weeks x 3 TB of RAM to do 300gb soil sample, with a slightly modified conventional assembler.)
  • 6. Soil is full of uncultured microbes Randy Jackson
  • 8. Great Prairie sampling design Reference core 1 cM 1M 1 cM 10 M 1M Soil cores: 1 inch diameter 4 inches deep (litter and roots removed) • Spatial samples: 16S rRNA, nifH • Reference sample sequenced (small unmixed sample) Reference bulk soil: stored for additional “omics” and metadata 10 M
  • 9. Soil contains thousands to millions of species (“Collector’s curves” of ~species) 2000 1800 1600 Number of OTUs 1400 Iowa Corn Iowa_Native_Prairie 1200 Kansas Corn 1000 Kansas_Native_Prairie Wisconsin Corn 800 Wisconsin Native Prairie Wisconsin Restored Prairie 600 Wisconsin Switchgrass 400 200 0 100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100 Number of Sequences
  • 10. The set of questions for soil -- discovery  What’s there?  Is it really that complex a community?  How “deep” do we need to sequence to sample thoroughly and systematically?  What organisms and gene functions are present, including non- canonical carbon and nitrogen cycling pathways?  What kind of organismal and functional overlap is there between different sites? (Total sampling needed?)  How is ecological complexity created & maintained?  How does ecological complexity respond to perturbation?
  • 11. Why are we applying short-read sequencing to this problem!?  Short-read sampling is deep and quantitative.  Statistical argument: your ability to observe rare organisms – your sensitivity of measurement – is directly related to the number of independent sequences you take.  Longer reads (PacBio, 454, Ion Torrent) are less informative.  Majority of metagenome studies going forward will make use of Illumina.  BUT this kind of sequence is challenging to analyze.  BUT, BUT this kind of sequence is necessary for high complexity environments.
  • 12. Challenges of short-read analysis  Low signal for functional analysis; no linkage at all.  High error rates.  Massive volume.  Rapidly changing technology.  Several approaches but we have settled on assembly.
  • 13. Our “Grand Challenge” dataset Total: 1,846 Gbp soil metagenome 600 MetaHIT (Qin et. al, 2011), 578 Gbp 500 Basepairs of Sequencing (Gbp) 400 300 Rumen (Hess et. al, 2011), 268 Gbp 200 Rumen K-mer Filtered, 111 Gbp 100 NCBI nr database, 37 Gbp 0 Iowa, Iowa, Native Kansas, Kansas, Wisconsin, Wisconsin, Wisconsin, Wisconsin, Continuous Prairie Cultivated Native Continuous Native Restored Switchgrass corn corn Prairie corn Prairie Prairie GAII HiSeq
  • 14. Approach 1: Partitioning Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences.
  • 15. Partitioning for scaling  Can be done in ~10x less memory than assembly.  Partition at low k and assemble exactly at any higher k (DBG).  Partitions can then be assembled independently  Multiple processors -> scaling  Multiple k, coverage -> improved assembly  Multiple assembly packages (tailored to high variation, etc.)  Can eliminate small partitions/contigs in the partitioning phase.  An incredibly convenient approach enabling divide & conquer approaches across the board.
  • 16. Technical challenges met (and defeated)  Novel data structure properties elucidated via percolation theory analysis (Pell et al., PNAS, 2012)  Exhaustive in-memory traversal of graphs containing 5-15 billion nodes.  Sequencing technology introduces false connections in graph (Howe et al., in prep.)  Only 20x improvement in assembly scaling .
  • 17. (NOVEL) Approach 2: Digital normalization Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
  • 18. Digital normalization discards redundant reads prior to assembly. This removes reads and decreases data size, eliminates errors from removed reads, and normalizes coverage across loci.
  • 19. Digital normalization algorithm for read in dataset: if median_kmer_count(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read Note, single pass; fixed memory.
  • 20. Downsample based on de Bruijn graph structure (which can be derived online)
  • 21. Shotgun data is often (1) high coverage and (2) biased in coverage. (MD amplified)
  • 22. Digital normalization fixes all that. Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramatical Assembly is 98% identical.
  • 23. Digital normalization retains information, while discarding data and errors
  • 24. Other key points  Virtually identical contig assembly; scaffolding works but is not yet cookie-cutter.  Digital normalization changes the way de Bruijn graph assembly scales from the size of your data set to the size of the source sample.  Always lower memory than assembly: we never collect most erroneous k-mers.  Digital normalization can be done once – and then assembly parameter exploration can be done.
  • 25. Quotable quotes. Comment: “This looks like a great solution for people who can’t afford real computers”. OK, but: “Buying ever bigger computers is a great solution for people who don’t want to think hard.” To be less snide: both kinds of scaling are needed, of course.
  • 26. Why use diginorm?  Use the cloud to assemble any microbial genomes incl. single- cell, many eukaryotic genomes, most mRNAseq, and many metagenomes.  Seems to provide leverage on addressing many biological or sample prep problems (single-cell & genome amplification MDA; metagenome; heterozygosity).  And, well, the general idea of locus specific graph analysis solves lots of things…
  • 27. Some interim concluding thoughts  Digital normalization-like approaches provide a path to solving the majority of assembly scaling problems, and will enable assembly on current cloud computing hardware.  This is not true for highly diverse metagenome environments…  For soil, we estimate that we need 50 Tbp / gram soil. Sigh.  Biologists and bioinformaticians hate:  Throwing away data  Caveats in bioinformatics papers (which reviewers like, note)  Digital normalization also discards abundance information.
  • 28. Evaluating sensitivity & specificity E. coli @ 10x + soil Digital Velvet minimus2 normalization Partitioning k from 19-51 merge + other filters 98.5% of E. coli
  • 29. Example Dethlefsen shotgun data set / Relman lab 251 m reads / 16gb FASTQ gzipped ~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 on Amazon EC2 (reads => final assembly + mapping) Assembly stats: 58,224 contigs > 1000 bp (average 3kb) summing to 190 mb genomic ~38 microbial genomes worth of DNA ~65% of reads mapped back to assembly
  • 30. What do we get for soil? Predicted Total % Reads Total Contigs protein rplb genes Assembly Assembled coding 2.5 bill 4.5 mill 19% 5.3 mill 391 3.5 bill 5.9 mill 22% 6.8 mill 466 This estimates number of species ^ Putting it in perspective: Total equivalent of ~1200 bacterial genomes Adina Howe Human genome ~3 billion bp
  • 31. Coverage of Assemblies Corn Prairie
  • 32. Nearest reference in NCBI Most abundant contigs in Iowa corn metagenome: Unknown; alpha/beta hydrolase (Streptomyces sp. S4); unknown; unknown; hypothetical protein HMP (Clostridium clostridioforme) Most abundant contigs in Iowa prairie metagenome: hypothetical protein (Rhodanobacter sp. 2APBS1); hypothetical protein (Oryza sativa Japonica); outer membrane adhesin like proteiin (Solitalea canadensis) ; alcohol dehydrogenase zinc-binding domain protein (Ktedonobacter racemifer); alcohol dehydrogenase GroES domain protein (Ktedonobacter racemifer)
  • 34. How many soil samples do we need to sequence?? Overlap between Iowa prairie & Iowa corn is significant! (Cumulative) Adina Howe
  • 35. Extracting whole genomes? So far, we have only assembled contigs, but not whole genomes. Can entire genomes be assembled from metagenomic data? Iverson et al. (2012), from the Armbrust lab, contains a technique for scaffolding metagenome contigs into ~whole genomes. YES.
  • 36. Perspective: the coming infopocalypse  Assembling about $20k worth of data, we can generate approximately 700 microbial genomes worth of data. (This is only going to go up in yield/$$, note.)  Most of these assembled genomic contigs (and genes) do not belong to studied organisms.  What the heck do they do??
  • 37. More thoughts on assembly  Illumina is the only game in town for sequencing complex microbial populations, but dealing with the data (volume, errors) is tricky. This problem is being solved, by us and others.  We’re working to make it as close to push button as possible, with objectively argued parameters and tools, and methods for evaluating new tools and sequencing types.  The community is working on dealing with data downstream of sequencing & assembly.  Most pipelines were built around 454 data – long reads, and relatively few of them.  With Illumina, we can get both long contigs and quantitative information about their abundance. This necessitates changes to pipelines like MG-RAST and HUMANn.
  • 38. The interpretation challenge  For soil, we have generated approximately 1200 bacterial genomes worth of assembled genomic DNA from two soil samples.  The vast majority of this genomic DNA contains unknown genes with largely unknown function.  Most annotations of gene function & interaction are from a few phylogenetically limited model organisms  Est 98% of annotations are computationally inferred: transferred from model organisms to genomic sequence, using homology.  Can these annotations be transferred? (Probably not.) This will be the biggest sequence analysis challenge of the next 50 years.
  • 39. Concluding thoughts on “assembly”  We can handle all the data (modulo another year or so of engineering.) Bring it on!  Our approaches let us (& you) assemble pretty much anything, much more easily than before. (Single cell, microbial genomes, transcriptomes, eukaryotic genomes, metagenomes, BAC sequencing…)  Seriously. No more problemo. Done. Finished. Kaput.  So now what?  Validation.  Interpretation and building general tools.  Interpretation relies on annotation… (Uh oh.)
  • 40. What are future needs?  High-quality, medium+ throughput annotation of genomes?  Extrapolating from model organisms is both immensely important and yet lacking.  Strong phylogenetic sampling bias in existing annotations.  Synthetic biology for investigating non-model organisms? (Cleverness in experimental biology doesn’t scale )  Integration of microbiology, community ecology/evolution modeling, and data analysis.
  • 41. Replication fu  In December 2011, I met Wes McKinney on a train and he convinced me that I should look at IPython Notebook.  This is an interactive Web notebook for data analysis…  Hey, neat! We can use this for replication!  All of our figures can be regenerated from scratch, on an EC2 instance, using a Makefile (data pipeline) and IPython Notebook (figure generation).  Everything is version controlled.  Honestly not much work, and will be less the next time.
  • 42.
  • 43. So… how’d that go?  People who already cared thought it was nifty. http://ivory.idyll.org/blog/replication-i.html  Almost nobody else cares ;(  Presub enquiry to editor: “Be sure that your paper can be reproduced.” Uh, please read my letter to the end?  “Could you improve your Makefile? I want to reimplement diginorm in another language and reuse your pipeline, but your Makefile is a mess.”  Incredibly useful, nonetheless. Already part of undergraduate and graduate training in my lab; helping us and others with next parpes; etc. etc. etc. Life is way too short to waste on unnecessarily replicating your own workflows, much less other people’s.
  • 44. Acknowledgements Collaborators Lab members involved  Jim Tiedje, MSU  Adina Howe (w/Tiedje)  Jason Pell  Billie Swalla, UW  Arend Hintze  Janet Jansson, LBNL  Rosangela Canino-Koning  Qingpeng Zhang  Susannah Tringe, JGI  Elijah Lowe  Likit Preeyanon  Jiarong Guo Funding  Tim Brom USDA NIFA; NSF IOS;  Kanchan Pavangadkar  Eric McDonald BEACON.
  • 45.
  • 46.  Current research in my lab Solving the rest of your problems  Preliminary functional analysis
  • 47. Search SSU rRNA gene in Illumina data 1. Randomly sequencing about 100bp long DNA in microbial genomes; 2. Everything is sequenced; 3. Not limited by primers or PCR bias; 4. Data mining is the challenge; SSU rRNA Gene length 10^3 10^7 10^4 10^6 Genome length Reads # Expected SSU RNA gene fragments
  • 48. Classification: Pyrotag vs shotgun RDP-pyrotag-SSU silva-pyrotag-SSU silva-shotgun-SSU
  • 49. 1542 bp Forward Start:907 End:1402 Reverse Sequence logo of short reads at Sequence logo of short reads at forward primer region: reverse primer region: AAACTYAAAKGAATTGACGG GYACACACCGCCCGT Current forward primer Current reverse primer (reverse complement) Primers used in 454 Titanium sequencing of SSU rRNA gene, using E.coli as an example. Consensus sequences of the primer region from Illumina reads suggest 1) searching method is good and 2)primer bias is minimal at the current E-value cutoff.
  • 50. CowRumen – JGI 16s primer mismatches postion A T C G Total 1G 0.001 0.001 0.002 0.996 12154 2T 0.002 0.983 0.003 0.012 12169 3G 0.001 0.001 0.002 0.995 12166 4C 0.001 0.001 0.996 0.002 12143 5C 0.003 0.001 0.994 0.002 12183 6A 0.986 0 0.008 0.005 12209 7G 0.001 0.001 0.002 0.996 12189 8C 0.001 0.001 0.996 0.002 12198 9A 0.978 0.001 0.017 0.004 12230 10G 0.001 0 0.002 0.997 12231 11C 0.001 0.001 0.996 0.002 12198 12C 0.002 0.001 0.994 0.003 12185 13G 0 0 0.002 0.997 12190 14C 0.001 0.001 0.995 0.003 12195 15G 0.001 0.001 0 0.998 12213 16G 0.001 0.001 0 0.998 12206 17T 0.002 0.974 0.003 0.021 12171 18A 0.99 0.001 0.006 0.003 12150 19A 0.995 0.001 0.002 0.002 12106
  • 51. Running HMMs over de Bruijn graphs (=> cross validation)  hmmgs: Assemble based on good-scoring HMM paths through the graph.  Independent of other assemblers; very sensitive, specific.  95% of hmmgs rplB domains are present in our partitioned assemblies. Jordan Fish, Qiong Wang, and Jim Cole (RDP)
  • 52. Streaming error correction. First pass Second pass Error-correct low- Error-correct low- All reads Yes! abundance k-mers in Yes! abundance k-mers in read. read. Does read come Does read come from a high- from a now high- coverage locus? coverage locus? Add read to graph Leave unchanged. and save for later. Only saved reads No! No! We can do error trimming of genomic, MDA, transcriptomic, metagenomic data in < 2 passes, fixed memory. We have just submitted a proposal to adapt Euler or Quake-like error correction (e.g. spectral alignment problem) to this
  • 53.
  • 54. Side note: error correction is the biggest “data” problem left in sequencing. Both for mapping & assembly.
  • 55. 1542 bp Forward Start:907 End:1402 Consensus of short reads at Consensus of short reads at forward primer region: reverse primer region: AAACTYAAAKGAATTGACGG Current forward primer  Figure. Primers used in 454 Titanium sequencing of 16S rRNA gene, using E.coli as an example. Consensus sequences of the primer region from Illumina reads suggest primer bias is minimal at the current E-value cutoff.
  • 56. Supplemental: abundance filtering is very lossy. Percent loss from abundance filtering (all >= 2) Largest partition 8.2x partition 3.8x partition contigs bp Total 0.0 20.0 40.0 60.0 80.0 100.0 Percentage lost
  • 59. Integrating modeling into data analysis?

Notes de l'éditeur

  1. Completely different style of assembler; useful for cross validation.
  2. Multi-k stuff.