SlideShare une entreprise Scribd logo
1  sur  40
Adina Howe
Michigan State University, Adjunct
Argonne National Laboratory, Postdoc
ASMWorkshop, May 2013
Visual Complexity
http://www.flickr.com/photos/maisonbisson
 Titus Brown
 Jim Tiedje
 Jason Pell
 Qingpeng Zhang
 Jordan Fish
 Eric McDonald
 Chris Welcher
 Aaron Garoutte
 Jiarong Guo
 Janet Jansson
 Susannah Tringe
MSU Lab: Collaborators:
 I will upload this on slideshare (adinachuanghowe)
 Khmer documentation
github.com/ged-lab/khmer/
https://khmer.readthedocs.org/en/latest/guide.html
 Manuscripts
Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
http://www.pnas.org/content/early/2012/07/25/1121464109
A reference-free algorithm for computational normalization of shotgun sequencing
data
http://arxiv.org/abs/1203.4802
Assembling large, complex metagenomes
http://arxiv.org/abs/1212.2832
High Abundance
Low Abundance
In t heenvironment (Our goal)
In our hands
X X
X
XX
XX
X
X
A few gotchas of sequencing:
Errors / Artifacts (confusion)
Diversity / Complexity (scale)
High Abundance
Low Abundance
In t heenvironment (Our goal)
In our hands
X X
X
XX
XX
X
X
High Abundance
Low Abundance
In theenvironment (Our goal)
In our hands
X
X
XX
XX
X
X1. Digital normalization (lossy compression)
2. Partitioning
3. Enabling usage of current previously unusable
assembly tools
 Reduces data for analysis
 Longer sequences (increased accuracy of annotation)
 Gene order
 Does not rely on known references, access to unknowns
 Creates new references
 Lots of assembly tools available
But…
 Reduces data for analysis
 Longer sequences (increased accuracy of annotation)
 Gene order
 Does not rely on known references, access to unknowns
 Creates new references
 Lots of assembly tools available
But…
Figure 11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
High memory requirements Depends on good (~10x) sequencing coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
Note that k-mer abundance is not properly represented here! Each
blue k-mer will be present around 10 times.
Each single base error generates ~k new k-mers.
Generally, erroneous k-mers show up only once – errors are random.
Low-abundance peak (errors)
High-abundance peak
(true k-mers)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
We can discard it for you…
A digital analog to cDNA library normalization,
diginorm:
Reference free.
Is single pass: looks at each read only once;
Does not “collect” the majority of errors;
Keeps all low-coverage reads;
Smooths out coverage of regions.
 Digital normalization produces “good”
metagenome assemblies.
 Smooths out abundance variation, strain
variation.
 Reduces computational requirements for
assembly.
 It also kinda makes sense :)
Split reads into “bins”
belonging to different
source species.
Can do this based almost
entirely on connectivity
of sequences.
“Divide and conquer”
Memory-efficient
implementation helps
to scale assembly.
Pell et al., 2012, PNAS
Low coverage is the dominant problem blocking assembly of
your soil metagenome.
 In order to build assemblies, each assembler
makes choices – uses heuristics – to reach a
conclusion.
 These heuristics may not be appropriate for your
sample!
 High polymorphism?
 Mixed population vs clonal?
 Genomic vs metagenomic vs mRNA
 Low coverage drives differences in assembly.
 We can assemble virtually anything but soil ;).
 Genomes, transcriptomes, MDA, mixtures, etc.
 Repeat resolution will be fundamentally limited by
sequencing technology (insert size; sampling depth)
 Strain variation confuses assembly, but does not
prevent useful results.
 Diginorm is systematic strategy to enable assembly.
 Banfield has shown how to deconvolve strains at
differential abundance.
 Kostas K. results suggest that there will be a species
gap sufficient to prevent contig misassembly.
 Most metagenomes require 50-150 GB of RAM.
 Many people don’t have access to computers of
that size.
 Amazon Web Services (aws.amazon.com) will
happily rent you such computers for $1-2/hr.
 http://ged.msu.edu/angus/2013-hmp-assembly-
webinar/index.html
 Optimizing our programs => faster.
 Building an evaluation framework for
metagenome assemblers.
 Error correction!
 Achieving one or more assemblies is fairly
straightforward.
 An assembly is a hypothesis and evaluating
them is challenging, however, and where you
should be thinking hardest about assembly.
 There are relatively few pipelines available
for analyzing assembled metagenomic data.
 Questions?
 How do we study complexity? Interactions? Diversity?
Communities? Evolution? Our environment?
Visual Complexity
http://www.flickr.com/photos/maisonbisson
• Major efforts of data collection
• Open-mind for discoveries
• Willingness to adjust to change
• Multiple efforts
• Well-designed experiments
Workshop example: Illumina deep
sequencing and scaling large datasets
on soil metagenomes
 We receive Gb of sequences
 Generally, my data is…
 Split by barcodes
 Untrimmed
 Adapters are present
 Two paired end fastq files
 Underestimation of computational
requirements:
 Quality control steps usually require 2-3 times the
amount of hard drive space
 Similarity comparison against known databases
impractical (soil metagenome ~50 years to BLAST)
Home Alone Scream
My first slide graphic that I’m scared may date me.
Two ways to reduce the onslaught:
Cluster into known observances (annotate,
bin)
Assembly
Some mix of the above
Ten of you upload 1 Hiseq
flowcell into MG-RAST
Illumina short reads from soil
metagenome (~100 bp)
454 short reads from soil
metagenome (~368 bp)
Assembled contigs (Illumina)
reads from soil metagenome
(~491 bp)
Read length will increase… computational requirements? Assembly great way to reduce data.

Contenu connexe

En vedette

高専カンファレン○
高専カンファレン○高専カンファレン○
高専カンファレン○
Daichi OBINATA
 
Smau Bologna 2014 - Twitter come strumento di comunicazione aziendale
Smau Bologna 2014 - Twitter come strumento di comunicazione aziendaleSmau Bologna 2014 - Twitter come strumento di comunicazione aziendale
Smau Bologna 2014 - Twitter come strumento di comunicazione aziendale
SMAU
 
спортивное соревнование 17.04.2015
спортивное соревнование 17.04.2015спортивное соревнование 17.04.2015
спортивное соревнование 17.04.2015
virtualtaganrog
 
Vichiunai Group Presentation
Vichiunai Group PresentationVichiunai Group Presentation
Vichiunai Group Presentation
Rob Schreur
 
Desierto egipcio
Desierto egipcioDesierto egipcio
Desierto egipcio
Plof
 
대신리포트_대신브라우저_140620
대신리포트_대신브라우저_140620대신리포트_대신브라우저_140620
대신리포트_대신브라우저_140620
DaishinSecurities
 
Analise Imagem Luciaguilherme Bataguassu
Analise Imagem Luciaguilherme BataguassuAnalise Imagem Luciaguilherme Bataguassu
Analise Imagem Luciaguilherme Bataguassu
Luciaguilherme
 
Question 7
Question 7Question 7
Question 7
bradmoss
 
Fluttuazioni Sinusoidali - Oltre le penalizzazioni di Google
Fluttuazioni Sinusoidali - Oltre le penalizzazioni di GoogleFluttuazioni Sinusoidali - Oltre le penalizzazioni di Google
Fluttuazioni Sinusoidali - Oltre le penalizzazioni di Google
Michele De Capitani
 

En vedette (17)

Molecular biology tecniques
Molecular biology tecniquesMolecular biology tecniques
Molecular biology tecniques
 
Metagenomics newer approach in understanding Microbes
Metagenomics newer approach in understanding Microbes  Metagenomics newer approach in understanding Microbes
Metagenomics newer approach in understanding Microbes
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
 
Presentation_NEW.PPTX
Presentation_NEW.PPTXPresentation_NEW.PPTX
Presentation_NEW.PPTX
 
高専カンファレン○
高専カンファレン○高専カンファレン○
高専カンファレン○
 
Smau Bologna 2014 - Twitter come strumento di comunicazione aziendale
Smau Bologna 2014 - Twitter come strumento di comunicazione aziendaleSmau Bologna 2014 - Twitter come strumento di comunicazione aziendale
Smau Bologna 2014 - Twitter come strumento di comunicazione aziendale
 
спортивное соревнование 17.04.2015
спортивное соревнование 17.04.2015спортивное соревнование 17.04.2015
спортивное соревнование 17.04.2015
 
Vichiunai Group Presentation
Vichiunai Group PresentationVichiunai Group Presentation
Vichiunai Group Presentation
 
Relacion de medida y pensamiento
Relacion de medida y pensamientoRelacion de medida y pensamiento
Relacion de medida y pensamiento
 
Desierto egipcio
Desierto egipcioDesierto egipcio
Desierto egipcio
 
대신리포트_대신브라우저_140620
대신리포트_대신브라우저_140620대신리포트_대신브라우저_140620
대신리포트_대신브라우저_140620
 
Analise Imagem Luciaguilherme Bataguassu
Analise Imagem Luciaguilherme BataguassuAnalise Imagem Luciaguilherme Bataguassu
Analise Imagem Luciaguilherme Bataguassu
 
Question 7
Question 7Question 7
Question 7
 
A.I. - 로봇의 진화. 어디까지 허용해야 하는가?
A.I. - 로봇의 진화. 어디까지 허용해야 하는가?A.I. - 로봇의 진화. 어디까지 허용해야 하는가?
A.I. - 로봇의 진화. 어디까지 허용해야 하는가?
 
0944388579
09443885790944388579
0944388579
 
Fluttuazioni Sinusoidali - Oltre le penalizzazioni di Google
Fluttuazioni Sinusoidali - Oltre le penalizzazioni di GoogleFluttuazioni Sinusoidali - Oltre le penalizzazioni di Google
Fluttuazioni Sinusoidali - Oltre le penalizzazioni di Google
 
Come sfruttare scientificamente Facebook per trovare nuovi clienti? Workshop ...
Come sfruttare scientificamente Facebook per trovare nuovi clienti? Workshop ...Come sfruttare scientificamente Facebook per trovare nuovi clienti? Workshop ...
Come sfruttare scientificamente Facebook per trovare nuovi clienti? Workshop ...
 

Similaire à ASM 2013 Metagenomic Assembly Workshop Slides

2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
c.titus.brown
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
c.titus.brown
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
c.titus.brown
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
c.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
c.titus.brown
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
c.titus.brown
 

Similaire à ASM 2013 Metagenomic Assembly Workshop Slides (20)

CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior software
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learning
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binning
 

Plus de Adina Chuang Howe

Plus de Adina Chuang Howe (13)

Merrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaMerrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, Nebraska
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
 
2015 Soil Science of America Meeting
2015 Soil Science of America Meeting2015 Soil Science of America Meeting
2015 Soil Science of America Meeting
 
ISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar Slides
 
Job Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringJob Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio Engineering
 
Adina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABE
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
 
Metagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopMetagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON Workshop
 
EPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkEPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data Talk
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

ASM 2013 Metagenomic Assembly Workshop Slides

  • 1. Adina Howe Michigan State University, Adjunct Argonne National Laboratory, Postdoc ASMWorkshop, May 2013 Visual Complexity http://www.flickr.com/photos/maisonbisson
  • 2.  Titus Brown  Jim Tiedje  Jason Pell  Qingpeng Zhang  Jordan Fish  Eric McDonald  Chris Welcher  Aaron Garoutte  Jiarong Guo  Janet Jansson  Susannah Tringe MSU Lab: Collaborators:
  • 3.  I will upload this on slideshare (adinachuanghowe)  Khmer documentation github.com/ged-lab/khmer/ https://khmer.readthedocs.org/en/latest/guide.html  Manuscripts Scaling metagenome sequence assembly with probabilistic de Bruijn graphs http://www.pnas.org/content/early/2012/07/25/1121464109 A reference-free algorithm for computational normalization of shotgun sequencing data http://arxiv.org/abs/1203.4802 Assembling large, complex metagenomes http://arxiv.org/abs/1212.2832
  • 4. High Abundance Low Abundance In t heenvironment (Our goal) In our hands X X X XX XX X X A few gotchas of sequencing: Errors / Artifacts (confusion) Diversity / Complexity (scale) High Abundance Low Abundance In t heenvironment (Our goal) In our hands X X X XX XX X X
  • 5. High Abundance Low Abundance In theenvironment (Our goal) In our hands X X XX XX X X1. Digital normalization (lossy compression) 2. Partitioning 3. Enabling usage of current previously unusable assembly tools
  • 6.  Reduces data for analysis  Longer sequences (increased accuracy of annotation)  Gene order  Does not rely on known references, access to unknowns  Creates new references  Lots of assembly tools available But…
  • 7.  Reduces data for analysis  Longer sequences (increased accuracy of annotation)  Gene order  Does not rely on known references, access to unknowns  Creates new references  Lots of assembly tools available But… Figure 11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes. High memory requirements Depends on good (~10x) sequencing coverage
  • 8. “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 9. Note that k-mer abundance is not properly represented here! Each blue k-mer will be present around 10 times.
  • 10. Each single base error generates ~k new k-mers. Generally, erroneous k-mers show up only once – errors are random.
  • 11.
  • 12.
  • 15. Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. A digital analog to cDNA library normalization, diginorm: Reference free. Is single pass: looks at each read only once; Does not “collect” the majority of errors; Keeps all low-coverage reads; Smooths out coverage of regions.
  • 23.  Digital normalization produces “good” metagenome assemblies.  Smooths out abundance variation, strain variation.  Reduces computational requirements for assembly.  It also kinda makes sense :)
  • 24. Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences. “Divide and conquer” Memory-efficient implementation helps to scale assembly. Pell et al., 2012, PNAS
  • 25.
  • 26.
  • 27.
  • 28. Low coverage is the dominant problem blocking assembly of your soil metagenome.
  • 29.  In order to build assemblies, each assembler makes choices – uses heuristics – to reach a conclusion.  These heuristics may not be appropriate for your sample!  High polymorphism?  Mixed population vs clonal?  Genomic vs metagenomic vs mRNA  Low coverage drives differences in assembly.
  • 30.
  • 31.  We can assemble virtually anything but soil ;).  Genomes, transcriptomes, MDA, mixtures, etc.  Repeat resolution will be fundamentally limited by sequencing technology (insert size; sampling depth)  Strain variation confuses assembly, but does not prevent useful results.  Diginorm is systematic strategy to enable assembly.  Banfield has shown how to deconvolve strains at differential abundance.  Kostas K. results suggest that there will be a species gap sufficient to prevent contig misassembly.
  • 32.  Most metagenomes require 50-150 GB of RAM.  Many people don’t have access to computers of that size.  Amazon Web Services (aws.amazon.com) will happily rent you such computers for $1-2/hr.  http://ged.msu.edu/angus/2013-hmp-assembly- webinar/index.html
  • 33.  Optimizing our programs => faster.  Building an evaluation framework for metagenome assemblers.  Error correction!
  • 34.  Achieving one or more assemblies is fairly straightforward.  An assembly is a hypothesis and evaluating them is challenging, however, and where you should be thinking hardest about assembly.  There are relatively few pipelines available for analyzing assembled metagenomic data.
  • 36.  How do we study complexity? Interactions? Diversity? Communities? Evolution? Our environment? Visual Complexity http://www.flickr.com/photos/maisonbisson • Major efforts of data collection • Open-mind for discoveries • Willingness to adjust to change • Multiple efforts • Well-designed experiments Workshop example: Illumina deep sequencing and scaling large datasets on soil metagenomes
  • 37.  We receive Gb of sequences  Generally, my data is…  Split by barcodes  Untrimmed  Adapters are present  Two paired end fastq files  Underestimation of computational requirements:  Quality control steps usually require 2-3 times the amount of hard drive space  Similarity comparison against known databases impractical (soil metagenome ~50 years to BLAST) Home Alone Scream My first slide graphic that I’m scared may date me.
  • 38. Two ways to reduce the onslaught: Cluster into known observances (annotate, bin) Assembly Some mix of the above
  • 39. Ten of you upload 1 Hiseq flowcell into MG-RAST
  • 40. Illumina short reads from soil metagenome (~100 bp) 454 short reads from soil metagenome (~368 bp) Assembled contigs (Illumina) reads from soil metagenome (~491 bp) Read length will increase… computational requirements? Assembly great way to reduce data.