8. Shotgun sequencing and
coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.
9. Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (300 Gbp for human)
10. An apparent digression:
Much of next-gen sequencing is redundant.
Can we eliminate this redundancy?
17. Basic diginorm algorithm
We can build the approach on anything that lets us estimate coverage of a read.
for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read
Note, single pass; sublinear memory.
18. The median k-mer count in a “sentence” is a
~good estimator of coverage.
This gives us a
reference-free
measure of
coverage.
28. Contig assembly now scales with underlying genome
size
Transcriptomes, microbial genomes incl MDA, and
most metagenomes can be assembled in under 50
GB of RAM, with ~identical or improved results.
30. A few “minor” drawbacks…
1. Repeats are eliminated preferentially.
2. Genuine graph tips are truncated.
3. Polyploidy is downsampled.
4. It’s not clear what happens to polymorphism.
(For these reasons, we have been pursuing alternate
approaches.)
Partially discussed in Brown et al., 2012 (arXiv)
31. But still quite useful…
1. Assembling soil metagenomes.
Howe et al., PNAS, 2014 (w/Tiedje)
2. Understanding bone-eating worm symbionts.
Goffredi et al., ISME, 2014.
3. An ultra-deep look at the lamprey transcriptome.
Scott et al., in preparation (w/Li)
4. Understanding development in Molgulid ascidians.
Stolfi et al, eLife 2014; etc.
32. …and widely used (?)
Estimated ~1000 users of our software.
Diginorm algorithm now included in Trinity software
from Broad Institute (~10,000 users)
Illumina TruSeq long-read technology now
incorporates our approach (~100,000 users)
35. Graph saturation
for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# high coverage read: do something clever!
36. “Few-pass” approach
By 20% of the way through 100x data set, more
than half the reads are saturated to 20x
37. Graph saturation
for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# high coverage read: do something clever!
38. (A) Streaming error detection for
metagenomes and transcriptomes
Illumina has between 0.1% and 1% error rate.
These errors confound mapping, assembly, etc.
(Think: what if you had error free reads? Life would be
much better)
41. …spectral error detection for reads =>
transcriptome, metagenome
True k-mers
Erroneous k-mers
Chaisson et al., 2009
42. Spectral error detection on
variable coverage data
How many of the errors can we pinpoint exactly?
f saturated Specificity Sensitivity
Genome 100% 71.4% 77.9%
Transcriptome 92% 67.7% 63.8%
Metagenome 96% 71.2% 68.9%
Real E. coli 100% 51.1% 72.4%
43. (B) Streaming error trimming for all shotgun
data
We can trim reads at first error.
f saturated error rate
total bases
trimmed
errors
remaining
Genome 100% 0.63% 31.90% 0.00%
Transcriptome
92% 0.65% 34.34% 0.07%
Metagenome
96% 0.62% 31.70% 0.04%
Real E. coli 100% 1.59% 12.96% 0.05%
44. (C) Streaming error correction
Once you can do error detection and trimming on a
streaming basis, why not error correction?
…using a new approach…
45. Streaming error correction of genomic, transcriptomic,
metagenomic data via graph alignment
Jason Pell, Jordan Fish, Michael Crusoe
47. …a bit more complex...
Jordan Fish and Michael Crusoe
48. Error correction on simulated E.
coli data
TP FP TN FN
Streaming 3,494,631 3,865 460,601,171 5,533
(corrected) (mistakes) (OK) (missed)
1% error rate, 100x coverage.
Michael Crusoe, Jordan Fish, Jason Pell
49.
50.
51. A few additional thoughts --
Sequence-to-graph alignment is a very general
concept.
Could replace mapping, variant calling, BLAST,
HMMER…
“Ask me for anything but time!”
-- Napoleon Bonaparte
52. (D) Calculating read error rates
by position within read
Shotgun data is randomly
sampled;
Any variation in mismatches with
reference by position is likely due
to errors or bias.
Reads
Assemble
Map reads to
assembly
Calculate position-specific
mismatches
53. Sequencing run error profiles
Via bowtie mapping against reference --
Reads from Shakya et al., pmid 23387867
54. We can do this sub-linearly from data w/no
reference!
Reads from Shakya et al., pmid 23387867
55. Reference-free error profile
analysis
1. Requires no prior information!
2. Immediate feedback on sequencing quality (for cores
& users)
3. Fast, lightweight (~100 MB, ~2 minutes)
4. Works for any shotgun sample (genomic,
metagenomic, transcriptomic).
5. Not affected by polymorphisms.
56. Reference-free error profile
analysis
7. …if we know where the errors are, we can trim them.
8. …if we know where the errors are, we can correct them.
9. …if we look at differences by graph position instead of by
read position, we can call variants.
=> Streaming, online variant calling?
64. Directions for streaming graph
analysis
Generate error profile for shotgun reads;
Variable coverage error trimming;
Streaming low-memory error correction for genomes,
metagenomes, and transcriptomes;
Strain variant detection & resolution;
Streaming variant analysis.
Michael Crusoe, Jordan Fish & Jason Pell
65. Our software is open source
Methods that aren’t broadly available are limited in their
utility!
Everything I talked about is in our github repository,
http://github.com/ged-lab/khmer
…it’s not necessarily trivial to use…
…but we’re happy to help.
67. Planned work: distributed graph database server
Web interface + API
Compute server
(Galaxy?
Arvados?)
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-talk.html
A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
Note that any such measure will do.
Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
The point is to enable biology; volume and velocity of data from sequencers is blocking.
Update from Jordan
Analyze data in cloud; import and export important; connect to other databases.