2012 stamps-mbl-2


Metagenome assembly – part II
C. Titus Brown
ctb@msu.edu

Warnings

This talk contains forward looking statements. These forward-
looking statements can be identified by terminology such as
“will”, “expects”, and “believes”.

-- Safe Harbor provisions of the

U.S. Private Securities Litigation Act

“Making predictions is difficult, especially if

they’re about the future.”

-- Attributed to Niels Bohr

The computational conundrum

More data => better.

and

More data => computationally more challenging.

Reads vs edges (memory) in de Bruijn graphs

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com

2. Big data sets require big machines
For even relatively small data sets, metagenomic assemblers scale
poorly.

Memory usage ~ “real” variation + number of errors

Number of errors ~ size of data set

Size of data set == big!!

(Estimated 6 weeks x 3 TB of RAM to do 300gb soil sample, with a
slightly modified conventional assembler.)

Soil is full of uncultured microbes

Randy Jackson

Great Prairie sampling design
Reference core

1 cM 1M

1 cM 10 M

1M

Soil cores: 1 inch diameter 4 inches deep
(litter and roots removed)
• Spatial samples: 16S rRNA, nifH
• Reference sample sequenced (small
unmixed sample)
Reference bulk soil: stored for additional
“omics” and metadata

10 M

Soil contains thousands to millions of species
(“Collector’s curves” of ~species)

2000

1800

1600
Number of OTUs

1400 Iowa Corn
Iowa_Native_Prairie
1200
Kansas Corn

1000 Kansas_Native_Prairie
Wisconsin Corn
800 Wisconsin Native Prairie
Wisconsin Restored Prairie
600
Wisconsin Switchgrass

400

200

0
100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100

Number of Sequences

The set of questions for soil -- discovery

 What’s there?

 Is it really that complex a community?

 How “deep” do we need to sequence to sample thoroughly and
systematically?

 What organisms and gene functions are present, including non-
canonical carbon and nitrogen cycling pathways?

 What kind of organismal and functional overlap is there between
different sites? (Total sampling needed?)

 How is ecological complexity created & maintained?

 How does ecological complexity respond to perturbation?

Why are we applying short-read
sequencing to this problem!?
 Short-read sampling is deep and quantitative.
 Statistical argument: your ability to observe rare
organisms – your sensitivity of measurement – is directly
related to the number of independent sequences you
take.
 Longer reads (PacBio, 454, Ion Torrent) are less
informative.

 Majority of metagenome studies going forward will make use
of Illumina.

 BUT this kind of sequence is challenging to analyze.

 BUT, BUT this kind of sequence is necessary for high
complexity environments.

Challenges of short-read analysis

 Low signal for functional analysis; no linkage at all.

 High error rates.

 Massive volume.

 Rapidly changing technology.

 Several approaches but we have settled on
assembly.

Our “Grand Challenge” dataset
Total: 1,846 Gbp soil metagenome
600 MetaHIT (Qin et. al, 2011), 578 Gbp

500
Basepairs of Sequencing (Gbp)

400

300 Rumen (Hess et. al, 2011), 268 Gbp

200 Rumen K-mer Filtered,
111 Gbp
100 NCBI nr database,
37 Gbp
0
Iowa, Iowa, Native Kansas, Kansas, Wisconsin, Wisconsin, Wisconsin, Wisconsin,
Continuous Prairie Cultivated Native Continuous Native Restored Switchgrass
corn corn Prairie corn Prairie Prairie
GAII HiSeq

Approach 1: Partitioning

Split reads into “bins”
belonging to different
source species.
Can do this based almost
entirely on
connectivity of
sequences.

Partitioning for scaling

 Can be done in ~10x less memory than assembly.

 Partition at low k and assemble exactly at any higher k (DBG).

 Partitions can then be assembled independently
 Multiple processors -> scaling
 Multiple k, coverage -> improved assembly
 Multiple assembly packages (tailored to high variation, etc.)

 Can eliminate small partitions/contigs in the partitioning
phase.

 An incredibly convenient approach enabling divide & conquer
approaches across the board.

Technical challenges met (and defeated)

 Novel data structure properties elucidated via
percolation theory analysis (Pell et al., PNAS, 2012)

 Exhaustive in-memory traversal of graphs
containing 5-15 billion nodes.

 Sequencing technology introduces false
connections in graph (Howe et al., in prep.)

 Only 20x improvement in assembly scaling .

(NOVEL)

Approach 2: Digital normalization

Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!

This 100x will consume disk
space and, because of
errors, memory.

Digital normalization discards redundant
reads prior to assembly.

This removes reads and decreases data size, eliminates errors from
removed reads, and normalizes coverage across loci.

Digital normalization algorithm

for read in dataset:
if median_kmer_count(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read

Note, single pass; fixed memory.

Downsample based on de Bruijn graph
structure (which can be derived online)

Shotgun data is often (1) high coverage
and (2) biased in coverage.

(MD amplified)

Digital normalization fixes all that.

Normalizes coverage

Discards redundancy

Eliminates majority of
errors

Scales assembly dramatical

Assembly is 98% identical.

Digital normalization retains information, while
discarding data and errors

Other key points
 Virtually identical contig assembly; scaffolding works but is not yet
cookie-cutter.

 Digital normalization changes the way de Bruijn graph assembly scales
from the size of your data set to the size of the source sample.

 Always lower memory than assembly: we never collect most erroneous
k-mers.

 Digital normalization can be done once – and then assembly parameter
exploration can be done.

Quotable quotes.

Comment: “This looks like a great solution for people who can’t
afford real computers”.

OK, but:

“Buying ever bigger computers is a great solution for people who
don’t want to think hard.”

To be less snide: both kinds of scaling are needed, of course.

Why use diginorm?

 Use the cloud to assemble any microbial genomes incl. single-
cell, many eukaryotic genomes, most mRNAseq, and many
metagenomes.

 Seems to provide leverage on addressing many biological or
sample prep problems (single-cell & genome amplification
MDA; metagenome; heterozygosity).

 And, well, the general idea of locus specific graph analysis
solves lots of things…

Some interim concluding thoughts
 Digital normalization-like approaches provide a path to solving
the majority of assembly scaling problems, and will enable
assembly on current cloud computing hardware.
 This is not true for highly diverse metagenome environments…
 For soil, we estimate that we need 50 Tbp / gram soil. Sigh.

 Biologists and bioinformaticians hate:
 Throwing away data
 Caveats in bioinformatics papers (which reviewers like, note)

 Digital normalization also discards abundance information.

Evaluating sensitivity & specificity

E. coli @ 10x + soil

Digital
Velvet minimus2
normalization Partitioning
k from 19-51 merge
+ other ﬁlters

98.5% of E. coli

Example
Dethlefsen shotgun data set / Relman lab

251 m reads / 16gb FASTQ gzipped
~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 on Amazon EC2
(reads => final assembly + mapping)

Assembly stats:
58,224 contigs > 1000 bp (average 3kb)
summing to 190 mb genomic
~38 microbial genomes worth of DNA
~65% of reads mapped back to assembly

What do we get for soil?
Predicted
Total % Reads
Total Contigs protein rplb genes
Assembly Assembled
coding

2.5 bill 4.5 mill 19% 5.3 mill 391

3.5 bill 5.9 mill 22% 6.8 mill 466

This estimates number of species ^
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes Adina Howe
Human genome ~3 billion bp

Coverage of Assemblies

Corn Prairie

Nearest reference in NCBI
Most abundant contigs in Iowa corn metagenome:
Unknown; alpha/beta hydrolase (Streptomyces sp. S4); unknown;
unknown; hypothetical protein HMP (Clostridium clostridioforme)

Most abundant contigs in Iowa prairie metagenome:
hypothetical protein (Rhodanobacter sp. 2APBS1); hypothetical protein
(Oryza sativa Japonica); outer membrane adhesin like proteiin (Solitalea
canadensis) ; alcohol dehydrogenase zinc-binding domain protein
(Ktedonobacter racemifer); alcohol dehydrogenase GroES domain protein
(Ktedonobacter racemifer)

How many soil samples do we need to
sequence??
Overlap between Iowa prairie & Iowa corn is
significant!

(Cumulative)
Adina Howe

Extracting whole genomes?
So far, we have only assembled contigs, but not whole
genomes.

Can entire genomes be
assembled from metagenomic
data?

Iverson et al. (2012), from
the Armbrust lab, contains a
technique for scaffolding
metagenome contigs into
~whole genomes. YES.

Perspective: the coming infopocalypse

 Assembling about $20k worth of data, we can generate
approximately 700 microbial genomes worth of data.
(This is only going to go up in yield/$$, note.)

 Most of these assembled genomic contigs
(and genes) do not belong to studied
organisms.

 What the heck do they do??

More thoughts on assembly
 Illumina is the only game in town for sequencing complex
microbial populations, but dealing with the data (volume, errors)
is tricky. This problem is being solved, by us and others.

 We’re working to make it as close to push button as
possible, with objectively argued parameters and tools, and
methods for evaluating new tools and sequencing types.

 The community is working on dealing with data downstream of
sequencing & assembly.
 Most pipelines were built around 454 data – long reads, and
relatively few of them.
 With Illumina, we can get both long contigs and quantitative
information about their abundance. This necessitates changes to
pipelines like MG-RAST and HUMANn.

The interpretation challenge
 For soil, we have generated approximately 1200 bacterial
genomes worth of assembled genomic DNA from two soil
samples.

 The vast majority of this genomic DNA contains unknown genes
with largely unknown function.

 Most annotations of gene function & interaction are from a few
phylogenetically limited model organisms
 Est 98% of annotations are computationally inferred: transferred
from model organisms to genomic sequence, using homology.
 Can these annotations be transferred? (Probably not.)

This will be the biggest sequence analysis challenge of the next 50
years.

Concluding thoughts on “assembly”
 We can handle all the data (modulo another year or so of
engineering.) Bring it on!

 Our approaches let us (& you) assemble pretty much anything, much
more easily than before. (Single cell, microbial
genomes, transcriptomes, eukaryotic genomes, metagenomes, BAC
sequencing…)

 Seriously. No more problemo. Done. Finished. Kaput.

 So now what?
 Validation.
 Interpretation and building general tools.
 Interpretation relies on annotation… (Uh oh.)

What are future needs?

 High-quality, medium+ throughput annotation of genomes?
 Extrapolating from model organisms is both immensely
important and yet lacking.
 Strong phylogenetic sampling bias in existing annotations.

 Synthetic biology for investigating non-model organisms?
(Cleverness in experimental biology doesn’t scale )

 Integration of microbiology, community ecology/evolution
modeling, and data analysis.

Replication fu

 In December 2011, I met Wes McKinney on a train and he
convinced me that I should look at IPython Notebook.

 This is an interactive Web notebook for data analysis…

 Hey, neat! We can use this for replication!
 All of our figures can be regenerated from scratch, on an EC2
instance, using a Makefile (data pipeline) and IPython
Notebook (figure generation).
 Everything is version controlled.
 Honestly not much work, and will be less the next time.

So… how’d that go?
 People who already cared thought it was nifty.

http://ivory.idyll.org/blog/replication-i.html

 Almost nobody else cares ;(
 Presub enquiry to editor: “Be sure that your paper can be reproduced.” Uh, please
read my letter to the end?
 “Could you improve your Makefile? I want to reimplement diginorm in another
language and reuse your pipeline, but your Makefile is a mess.”

 Incredibly useful, nonetheless. Already part of undergraduate and graduate training
in my lab; helping us and others with next parpes; etc. etc. etc.

Life is way too short to waste on unnecessarily replicating your own workflows, much
less other people’s.

Acknowledgements
Collaborators
Lab members involved  Jim Tiedje, MSU
 Adina Howe (w/Tiedje)
 Jason Pell  Billie Swalla, UW
 Arend Hintze
 Janet Jansson, LBNL
 Rosangela Canino-Koning
 Qingpeng Zhang  Susannah Tringe, JGI
 Elijah Lowe
 Likit Preeyanon
 Jiarong Guo Funding
 Tim Brom USDA NIFA; NSF IOS;
 Kanchan Pavangadkar
 Eric McDonald BEACON.


Current research in my lab
Solving the rest of your problems 

Preliminary functional analysis

Search SSU rRNA gene in Illumina data

1. Randomly sequencing about 100bp long DNA in
microbial genomes;

2. Everything is sequenced;

3. Not limited by primers or PCR bias;

4. Data mining is the challenge;
SSU rRNA Gene length
10^3
10^7 10^4
10^6
Genome length
Reads # Expected SSU RNA gene
fragments

Classification: Pyrotag vs shotgun

RDP-pyrotag-SSU
silva-pyrotag-SSU
silva-shotgun-SSU

1542 bp
Forward

Start:907 End:1402

Reverse

Sequence logo of short reads at Sequence logo of short reads at
forward primer region: reverse primer region:

AAACTYAAAKGAATTGACGG GYACACACCGCCCGT
Current forward primer Current reverse primer
(reverse complement)

Primers used in 454 Titanium sequencing of SSU rRNA gene, using
E.coli as an example. Consensus sequences of the primer region from
Illumina reads suggest 1) searching method is good and 2)primer bias
is minimal at the current E-value cutoff.

CowRumen – JGI 16s primer mismatches
postion A T C G Total
1G 0.001 0.001 0.002 0.996 12154
2T 0.002 0.983 0.003 0.012 12169
3G 0.001 0.001 0.002 0.995 12166
4C 0.001 0.001 0.996 0.002 12143
5C 0.003 0.001 0.994 0.002 12183
6A 0.986 0 0.008 0.005 12209
7G 0.001 0.001 0.002 0.996 12189
8C 0.001 0.001 0.996 0.002 12198
9A 0.978 0.001 0.017 0.004 12230
10G 0.001 0 0.002 0.997 12231
11C 0.001 0.001 0.996 0.002 12198
12C 0.002 0.001 0.994 0.003 12185
13G 0 0 0.002 0.997 12190
14C 0.001 0.001 0.995 0.003 12195
15G 0.001 0.001 0 0.998 12213
16G 0.001 0.001 0 0.998 12206
17T 0.002 0.974 0.003 0.021 12171
18A 0.99 0.001 0.006 0.003 12150
19A 0.995 0.001 0.002 0.002 12106

Running HMMs over de Bruijn graphs
(=> cross validation)

 hmmgs: Assemble based
on good-scoring HMM
paths through the graph.

 Independent of other
assemblers; very
sensitive, specific.

 95% of hmmgs rplB
domains are present in
our partitioned
assemblies.

Jordan Fish, Qiong Wang, and Jim Cole (RDP)

Streaming error correction.
First pass Second pass

Error-correct low- Error-correct low-
All reads Yes! abundance k-mers in Yes! abundance k-mers in
read. read.

Does read come Does read come
from a high- from a now high-
coverage locus? coverage locus?
Add read to graph
Leave unchanged.
and save for later.
Only saved reads
No! No!

We can do error trimming of
genomic, MDA, transcriptomic, metagenomic data in < 2
passes, fixed memory.
We have just submitted a proposal to adapt Euler or Quake-like
error correction (e.g. spectral alignment problem) to this

Side note: error correction is the
biggest “data” problem left in
sequencing.

Both for mapping & assembly.

1542 bp
Forward
Start:907 End:1402

Consensus of short reads at Consensus of short reads at
forward primer region: reverse primer region:

AAACTYAAAKGAATTGACGG
Current forward primer

 Figure. Primers used in 454 Titanium sequencing
of 16S rRNA gene, using E.coli as an example.
Consensus sequences of the primer region from
Illumina reads suggest primer bias is minimal at the
current E-value cutoff.

Supplemental: abundance filtering is very lossy.

Percent loss from abundance filtering (all >= 2)

Largest partition

8.2x partition

3.8x partition contigs
bp
Total

0.0 20.0 40.0 60.0 80.0 100.0
Percentage lost

Comparing assemblies / dendrogram

Integrating modeling into data analysis?

2012 stamps-mbl-2

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Similaire à 2012 stamps-mbl-2

Similaire à 2012 stamps-mbl-2 (16)

Plus de c.titus.brown

Plus de c.titus.brown (20)

Dernier

Dernier (20)

2012 stamps-mbl-2

Notes de l'éditeur