Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
2014 marine-microbes-grc
1. Assembling diverse & rich
metagenomes: the secrets of
the ancients.
C. Titus Brown
ctb@msu.edu
2. Introducing myself --
ged.msu.edu/
“Data-intensive biology” – tools, etc.
Not a marine microbiologist at all!
Note: these slides are all on slideshare.
(Google “titus brown slide share”)
3. My goals
Enable hypothesis-driven biology
through better hypothesis generation
& refinement.
Devalue “interest level” of sequence
analysis and put myself out of a job.
Be a good mutualist!
4. Part I: Soil Assembly & the
Great Prairie Grand
Challenge
2008
5. Soil microbial ecology -
questions
What ecosystem level functions are present,
and how do microbes do them?
How does agricultural soil differ from native
soil?
How does soil respond to climate
perturbation?
Questions that are not easy to answer
without shotgun sequencing:
◦ What kind of strain-level heterogeneity is present
in the population?
◦ What does the phage and viral population look
like?
◦ What species are where?
7. Approach – assemble into
contigs.
We found that short reads from
phylogenetically distant and
microbially diverse environments
could not be reliably annotated.
=> Build into longer contigs first.
…5 year odyssey…
8. (Friends don’t let friends BLAST short
reads.**)
** Applicable to most environmental samples.Howe et al., 2014
9. Developed two new methods
--
I. Computational “cell sorting”
II. Computational “library
normalization.”
See:
• Pell et al., Tiedje, Brown (2012);
• Howe et al., Tiedje, Brown (2014);
• Goffredi et al. (2014)
16. Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Result: we (easily, casually) assembled
two of the biggest metagenomes ever.
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Howe et al, 2014; pmid 24632729
(I’ll come back to this)
17. So…
We can now achieve an assembly of
pretty much anything (soil was really
hard, virtually everything else is easier!)
Lots of people are interested in
collaborating with us on this!
…but we regard it as a
largely solved problem.
18. I: assembly “protocols”
khmer-protocols: open, versioned, citable,
forkable set of instructions to assemble euk
mRNAseq and metagenomes on widely
accessible compute resources.
Explicit command-line instructions to go from
raw reads to annotated “final product”.
For mRNAseq: ~$150/compute for $2000 of
data.
(Still in beta, note.)
20. Example - Deep Carbon data
set
Masimong Gold Mine; microbial cells
filtered from fracture water from within
a 1.9km borehole. (32,000 year old
water)
5.6m reads, 601.3 Mbp;
◦ computational protocol took 4 hours;
◦ Assembled to 56 Mbp > 300 bp
◦ longest contig is 73kb
◦ 70% of paired-end reads mapped.
20
w/M.C.Y. Lau, Tullis Onstott
21. Our (open) approach:
If the protocols work for you, great! Cite
us.
If the protocols don’t work for you, please
let us know so we can fix them.
If it’s a challenging problem, we’d love
to collaborate.
We are also happy to help train people.
22. Things we no longer worry about
(much) – let’s chat:
Inter-species assembly chimerae
…apart from w/in strain variants, chimerae
are hard to form with contig assembly.
Finding homology matches in metagenomes
…contigs give as good a
match as possible.
Assembling contigs when we have sufficient
coverage
…not enough coverage is
usually the problem.
23. II: Shotgun sequencing and
coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.
23
24. Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (300 Gbp for human)
24
26. Downstream goals of
assembly:
(Even assuming ribotyping works perfectly)
Annotate genes with higher confidence.
Reconstruct operons & ultimately even
full genomes.
Analyze strain variation.
Study organisms that ribotyping can’t
(phage & virus)
27. Main questions --
I. How do we know if we’ve sequenced
enough?
II. Can we predict how much more we
need to sequence to see <insert
some feature here>?
Note: necessary sequencing depth cannot
accurately be predicted from SSU/amplicon
data
28. Method 1: looking for WGS
saturation
We can track how many sequences we
keep of the sequences we’ve seen, to
detect saturation.
29. Data from Shakya et al., 2013 (pmid: 23387867
We can detect saturation of
shotgun sequencing
30. Data from Shakya et al., 2013 (pmid: 23387867
We can detect saturation of
shotgun sequencing
C=10, for assembly
31. Estimating metagenome nt
richness:
# bp at saturation / coverage
MM5 deep carbon: 60 Mbp
Iowa prairie soil: 12 Gbp
Amazon Rain Forest Microbial
Observatory soil: 26 Gbp
Assumes: few entirely erroneous reads (upper
bound); at saturation (lower bound).
31
32. WGS saturation approach:
Tells us when we have enough
sequence.
Can’t be predictive… if you haven’t
sampled something, you can’t say
anything about it.
Can we correlate deep amplicon
sequencing with shallower WGS?
33. Correlating 16s and shotgun
seq
Errors do not strongly affect saturatio
How
much
of 16s
do
you
see…
with how much shotgun sequencing
34. Data from Shakya et al., 2013 (pmid: 23387867
WGS saturation ~matches 16s saturation
< rRNA copy
number >
35. 16s region choice is not significant (?!)
Data from Shakya et al., 2013 (pmid: 23387867
36. Method is robust to organisms
unsampled by amplicons.
Insensitive to
amplicon primer
bias.
Robust to genome
size differences,
eukaryotes, phage.
Data from Shakya et al., 2013 (pmid: 23387867
41. Thoughts on 16s/WGS
comparison:
Robust to some real problems (primer
bias; organisms unsampled by
amplicon seq) & insensitive to 16s seq
error.
Hopefully can be used to build a
predictive framework to answer “how
much more sequencing should I do?”
◦ Sensitivity: “What have I missed?”
◦ Planning: “How much $$ should I ask
42. Other things that y’all might be
interested in:
Comparing 16s from amplicon and
shotgun sequencing.
Metatranscriptome assembly protocol
Biogeography of genomic sequence
43. Metatranscriptome assembly
(soil)
Total Length
(bp)
Total rRNA
(bp)
Total
annotated by
MG-RAST
m5nr SEED
Unassembled
MetaT
20,525,296,600
16,987,863,800
(82.8%)
48,080,200
(0.23%)
Assembled
MetaT
32,471,548
7,061,913
(21.8%)
2,075,701
(6.4%)
Aaron Garoutte (w/Tiedje & Howe)
45. Primer bias against
Verrucomicrobia
Check taxonomy of reads causing
mismatch (A)
Verrucomicrobia cause
70% (117/168) of
mismatch
Current primer is not effective at amplifying
Verrucomicrobia
Jaron Guo
47. Biogeography of genomic DNA
(2)
How much genomic richness is shared
between different sites?
Qingpeng Zhang
48. Concluding thoughts
Tools and protocols for data analysis are
fast becoming intrinsic to practice of
biology.
◦ Most tools are wrong, but some are useful.
◦ All of our tools are openly, freely available in
every way possible.
We are trying to make assembly fast,
cheap, easy, and good.
We are building on our assembly-based
approaches & intuition to tackle other
questions.
49. Big Data is neither the real
problem nor the solution.
Dealing with Big Data requires a new
mentality, so training/experience is
probably most effective way forward.
With sequencing, few if any of your
biology problems go away, although
some aspects may become more
tractable.
Think future: any -ome you want from
any sample you can get. …So now
50. Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
We don’t know what most genes do.
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Howe et al, 2014; pmid 24632729
51. Potential discussion topics
A. Funding and collaboration models.
B. Leveraging data & computation to
help understand gene function.
C. Computational/data infrastructure
…but planning for poverty, not wealth:
sustainability and “bus factor”.
D. Capacity building
Standardized data sets; data availability.
Workshops and training.
52. Training in data analysis et al.
Software Carpentry.
Data Carpentry.
STAMPS, EDAMAME, MSU NGS
course.
<other courses go here>
53. Potential discussion topics
A. Funding and collaboration models.
B. Leveraging data & computation to
help understand gene function.
C. Computational/data infrastructure
…but planning for poverty, not
wealth: sustainability and “bus factor”.
D. Capacity building
Standardized data sets; data
availability.
Workshops and training.
Notes de l'éditeur
Fly-over country (that I live in)
Nothing more frustrating to biologists than having data that you can’t analyze
Est 200 hrs of my effort
~Easy to say how much you need for a single genome.
Note: 16s is higher copy number, more sensitive than WGS.
otu5 is acidobacterium; one species, Acidobacterium capsulatum, with one rRNA; 4.6% of BA community, 4.7% of Illumina reads;
# otu2 is chlorobium; five species, total of 10 rRNA; 9.1% of Illumina. Correction factor of 5.
JGI v6, 454 amplicon sequencing
Original motivation was, should we combine samples?