2. Warnings
This talk contains forward looking statements. These forward-
looking statements can be identified by terminology such as
“will”, “expects”, and “believes”.
-- Safe Harbor provisions of the
U.S. Private Securities Litigation Act
“Making predictions is difficult, especially if
they’re about the future.”
-- Attributed to Niels Bohr
5. 2. Big data sets require big machines
For even relatively small data sets, metagenomic assemblers scale
poorly.
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
Size of data set == big!!
(Estimated 6 weeks x 3 TB of RAM to do 300gb soil sample, with a
slightly modified conventional assembler.)
6. Soil is full of uncultured microbes
Randy Jackson
8. Great Prairie sampling design
Reference core
1 cM 1M
1 cM 10 M
1M
Soil cores: 1 inch diameter 4 inches deep
(litter and roots removed)
• Spatial samples: 16S rRNA, nifH
• Reference sample sequenced (small
unmixed sample)
Reference bulk soil: stored for additional
“omics” and metadata
10 M
9. Soil contains thousands to millions of species
(“Collector’s curves” of ~species)
2000
1800
1600
Number of OTUs
1400 Iowa Corn
Iowa_Native_Prairie
1200
Kansas Corn
1000 Kansas_Native_Prairie
Wisconsin Corn
800 Wisconsin Native Prairie
Wisconsin Restored Prairie
600
Wisconsin Switchgrass
400
200
0
100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100
Number of Sequences
10. The set of questions for soil -- discovery
What’s there?
Is it really that complex a community?
How “deep” do we need to sequence to sample thoroughly and
systematically?
What organisms and gene functions are present, including non-
canonical carbon and nitrogen cycling pathways?
What kind of organismal and functional overlap is there between
different sites? (Total sampling needed?)
How is ecological complexity created & maintained?
How does ecological complexity respond to perturbation?
11. Why are we applying short-read
sequencing to this problem!?
Short-read sampling is deep and quantitative.
Statistical argument: your ability to observe rare
organisms – your sensitivity of measurement – is directly
related to the number of independent sequences you
take.
Longer reads (PacBio, 454, Ion Torrent) are less
informative.
Majority of metagenome studies going forward will make use
of Illumina.
BUT this kind of sequence is challenging to analyze.
BUT, BUT this kind of sequence is necessary for high
complexity environments.
12. Challenges of short-read analysis
Low signal for functional analysis; no linkage at all.
High error rates.
Massive volume.
Rapidly changing technology.
Several approaches but we have settled on
assembly.
14. Approach 1: Partitioning
Split reads into “bins”
belonging to different
source species.
Can do this based almost
entirely on
connectivity of
sequences.
15. Partitioning for scaling
Can be done in ~10x less memory than assembly.
Partition at low k and assemble exactly at any higher k (DBG).
Partitions can then be assembled independently
Multiple processors -> scaling
Multiple k, coverage -> improved assembly
Multiple assembly packages (tailored to high variation, etc.)
Can eliminate small partitions/contigs in the partitioning
phase.
An incredibly convenient approach enabling divide & conquer
approaches across the board.
16. Technical challenges met (and defeated)
Novel data structure properties elucidated via
percolation theory analysis (Pell et al., PNAS, 2012)
Exhaustive in-memory traversal of graphs
containing 5-15 billion nodes.
Sequencing technology introduces false
connections in graph (Howe et al., in prep.)
Only 20x improvement in assembly scaling .
17. (NOVEL)
Approach 2: Digital normalization
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
18. Digital normalization discards redundant
reads prior to assembly.
This removes reads and decreases data size, eliminates errors from
removed reads, and normalizes coverage across loci.
19. Digital normalization algorithm
for read in dataset:
if median_kmer_count(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read
Note, single pass; fixed memory.
24. Other key points
Virtually identical contig assembly; scaffolding works but is not yet
cookie-cutter.
Digital normalization changes the way de Bruijn graph assembly scales
from the size of your data set to the size of the source sample.
Always lower memory than assembly: we never collect most erroneous
k-mers.
Digital normalization can be done once – and then assembly parameter
exploration can be done.
25. Quotable quotes.
Comment: “This looks like a great solution for people who can’t
afford real computers”.
OK, but:
“Buying ever bigger computers is a great solution for people who
don’t want to think hard.”
To be less snide: both kinds of scaling are needed, of course.
26. Why use diginorm?
Use the cloud to assemble any microbial genomes incl. single-
cell, many eukaryotic genomes, most mRNAseq, and many
metagenomes.
Seems to provide leverage on addressing many biological or
sample prep problems (single-cell & genome amplification
MDA; metagenome; heterozygosity).
And, well, the general idea of locus specific graph analysis
solves lots of things…
27. Some interim concluding thoughts
Digital normalization-like approaches provide a path to solving
the majority of assembly scaling problems, and will enable
assembly on current cloud computing hardware.
This is not true for highly diverse metagenome environments…
For soil, we estimate that we need 50 Tbp / gram soil. Sigh.
Biologists and bioinformaticians hate:
Throwing away data
Caveats in bioinformatics papers (which reviewers like, note)
Digital normalization also discards abundance information.
28. Evaluating sensitivity & specificity
E. coli @ 10x + soil
Digital
Velvet minimus2
normalization Partitioning
k from 19-51 merge
+ other filters
98.5% of E. coli
29. Example
Dethlefsen shotgun data set / Relman lab
251 m reads / 16gb FASTQ gzipped
~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 on Amazon EC2
(reads => final assembly + mapping)
Assembly stats:
58,224 contigs > 1000 bp (average 3kb)
summing to 190 mb genomic
~38 microbial genomes worth of DNA
~65% of reads mapped back to assembly
30. What do we get for soil?
Predicted
Total % Reads
Total Contigs protein rplb genes
Assembly Assembled
coding
2.5 bill 4.5 mill 19% 5.3 mill 391
3.5 bill 5.9 mill 22% 6.8 mill 466
This estimates number of species ^
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes Adina Howe
Human genome ~3 billion bp
34. How many soil samples do we need to
sequence??
Overlap between Iowa prairie & Iowa corn is
significant!
(Cumulative)
Adina Howe
35. Extracting whole genomes?
So far, we have only assembled contigs, but not whole
genomes.
Can entire genomes be
assembled from metagenomic
data?
Iverson et al. (2012), from
the Armbrust lab, contains a
technique for scaffolding
metagenome contigs into
~whole genomes. YES.
36. Perspective: the coming infopocalypse
Assembling about $20k worth of data, we can generate
approximately 700 microbial genomes worth of data.
(This is only going to go up in yield/$$, note.)
Most of these assembled genomic contigs
(and genes) do not belong to studied
organisms.
What the heck do they do??
37. More thoughts on assembly
Illumina is the only game in town for sequencing complex
microbial populations, but dealing with the data (volume, errors)
is tricky. This problem is being solved, by us and others.
We’re working to make it as close to push button as
possible, with objectively argued parameters and tools, and
methods for evaluating new tools and sequencing types.
The community is working on dealing with data downstream of
sequencing & assembly.
Most pipelines were built around 454 data – long reads, and
relatively few of them.
With Illumina, we can get both long contigs and quantitative
information about their abundance. This necessitates changes to
pipelines like MG-RAST and HUMANn.
38. The interpretation challenge
For soil, we have generated approximately 1200 bacterial
genomes worth of assembled genomic DNA from two soil
samples.
The vast majority of this genomic DNA contains unknown genes
with largely unknown function.
Most annotations of gene function & interaction are from a few
phylogenetically limited model organisms
Est 98% of annotations are computationally inferred: transferred
from model organisms to genomic sequence, using homology.
Can these annotations be transferred? (Probably not.)
This will be the biggest sequence analysis challenge of the next 50
years.
39. Concluding thoughts on “assembly”
We can handle all the data (modulo another year or so of
engineering.) Bring it on!
Our approaches let us (& you) assemble pretty much anything, much
more easily than before. (Single cell, microbial
genomes, transcriptomes, eukaryotic genomes, metagenomes, BAC
sequencing…)
Seriously. No more problemo. Done. Finished. Kaput.
So now what?
Validation.
Interpretation and building general tools.
Interpretation relies on annotation… (Uh oh.)
40. What are future needs?
High-quality, medium+ throughput annotation of genomes?
Extrapolating from model organisms is both immensely
important and yet lacking.
Strong phylogenetic sampling bias in existing annotations.
Synthetic biology for investigating non-model organisms?
(Cleverness in experimental biology doesn’t scale )
Integration of microbiology, community ecology/evolution
modeling, and data analysis.
41. Replication fu
In December 2011, I met Wes McKinney on a train and he
convinced me that I should look at IPython Notebook.
This is an interactive Web notebook for data analysis…
Hey, neat! We can use this for replication!
All of our figures can be regenerated from scratch, on an EC2
instance, using a Makefile (data pipeline) and IPython
Notebook (figure generation).
Everything is version controlled.
Honestly not much work, and will be less the next time.
42.
43. So… how’d that go?
People who already cared thought it was nifty.
http://ivory.idyll.org/blog/replication-i.html
Almost nobody else cares ;(
Presub enquiry to editor: “Be sure that your paper can be reproduced.” Uh, please
read my letter to the end?
“Could you improve your Makefile? I want to reimplement diginorm in another
language and reuse your pipeline, but your Makefile is a mess.”
Incredibly useful, nonetheless. Already part of undergraduate and graduate training
in my lab; helping us and others with next parpes; etc. etc. etc.
Life is way too short to waste on unnecessarily replicating your own workflows, much
less other people’s.
46.
Current research in my lab
Solving the rest of your problems
Preliminary functional analysis
47. Search SSU rRNA gene in Illumina data
1. Randomly sequencing about 100bp long DNA in
microbial genomes;
2. Everything is sequenced;
3. Not limited by primers or PCR bias;
4. Data mining is the challenge;
SSU rRNA Gene length
10^3
10^7 10^4
10^6
Genome length
Reads # Expected SSU RNA gene
fragments
49. 1542 bp
Forward
Start:907 End:1402
Reverse
Sequence logo of short reads at Sequence logo of short reads at
forward primer region: reverse primer region:
AAACTYAAAKGAATTGACGG GYACACACCGCCCGT
Current forward primer Current reverse primer
(reverse complement)
Primers used in 454 Titanium sequencing of SSU rRNA gene, using
E.coli as an example. Consensus sequences of the primer region from
Illumina reads suggest 1) searching method is good and 2)primer bias
is minimal at the current E-value cutoff.
51. Running HMMs over de Bruijn graphs
(=> cross validation)
hmmgs: Assemble based
on good-scoring HMM
paths through the graph.
Independent of other
assemblers; very
sensitive, specific.
95% of hmmgs rplB
domains are present in
our partitioned
assemblies.
Jordan Fish, Qiong Wang, and Jim Cole (RDP)
52. Streaming error correction.
First pass Second pass
Error-correct low- Error-correct low-
All reads Yes! abundance k-mers in Yes! abundance k-mers in
read. read.
Does read come Does read come
from a high- from a now high-
coverage locus? coverage locus?
Add read to graph
Leave unchanged.
and save for later.
Only saved reads
No! No!
We can do error trimming of
genomic, MDA, transcriptomic, metagenomic data in < 2
passes, fixed memory.
We have just submitted a proposal to adapt Euler or Quake-like
error correction (e.g. spectral alignment problem) to this
53.
54. Side note: error correction is the
biggest “data” problem left in
sequencing.
Both for mapping & assembly.
55. 1542 bp
Forward
Start:907 End:1402
Consensus of short reads at Consensus of short reads at
forward primer region: reverse primer region:
AAACTYAAAKGAATTGACGG
Current forward primer
Figure. Primers used in 454 Titanium sequencing
of 16S rRNA gene, using E.coli as an example.
Consensus sequences of the primer region from
Illumina reads suggest primer bias is minimal at the
current E-value cutoff.
56. Supplemental: abundance filtering is very lossy.
Percent loss from abundance filtering (all >= 2)
Largest partition
8.2x partition
3.8x partition contigs
bp
Total
0.0 20.0 40.0 60.0 80.0 100.0
Percentage lost