1. Making de novo assembly cheap & easy:
standardized protocols for mRNAseq and
metagenome assembly and analysis
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Jan 2014
ctb@msu.edu
2. My lab’s focus
De novo assembly and efficient/effective use of
NGS, especially for non-model organism.
Open source software engineering.
Training and education in NGS.
3. There is quite a bit of life left to sequence & assem
http://pacelab.colorado.edu/
4. Three problems:
1.
Assembly memory & compute requirements?
2.
It’s a complex process; what are good defaults?
3.
Training is limited in opportunity, difficult for
students, not always effective.
6. So, we want to go from raw data:
Name
@SRR606249.17/1
GAGTATGTTCTCATAGAGGTTGGTANNNNT
+
B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score
@SRR606249.17/2
CGAANNNNNNNNNNNNNNNNNCCTGGCTCA
+
CCCF#################22@GHIJJJ
9. Shotgun sequencing & de novo
assembly:
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
10. Why are big data sets difficult?
Need to resolve errors: the more coverage there is, the
more errors there are.
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
11. The scaling problem
We can cheaply gather DNA data in quantities
sufficient to swamp straightforward assembly
algorithms running on commodity hardware.
Since ~2008:
The field has engaged in lots of engineering
optimization…
…but the data generation rate has consistently
outstripped Moore’s Law.
18. Contig assembly now scales a lot better.
Most samples can be assembled in < 50 GB of
memory.
19. Diginorm is widely useful, becoming
widely used:
1. Assembly of the H. contortus parasitic nematode
genome, a “high polymorphism/variable coverage”
problem.
(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P.
marinus) transcriptome, a “big assembly” problem.
(in prep)
3. Osedax symbiont metagenome, a “contaminated
metagenome” problem (Goffredi et al, 2013; pmid
20. Second problem: too many choices!
Read trimming
and filtering
(x100)
What
programs and
options do
you use??
Assembly
(x10)
Quantification
(x20)
Science!
(x 10,000)
Annotation
(x20)
21. Third problem: training
I teach:
Summer NGS course (two weeks, KBS); heavily
oversubscribed.
Many ad hoc workshops
Fall BEACON course (intro computational science)
Others teach:
Summer/fall workshops (Robin Buell)
Various genomics/bioinformatics courses (Shin-han
Shiu, Rob Britton, ???)
22. Overall training results:
We can fairly easily get people over the initial
“technical” hump (here are some programs,
here’s how to use them).
We can begin to teach people the way to think
about the problem.
People have a really tough time connecting
generic instruction to their own research,
however!
(And people need to learn how to analyze their own
23. Three problems:
1.
Assembly memory & compute requirements?
2.
It’s a complex process; what are good defaults?
3.
Training is limited in opportunity, difficult for
students, not always effective.
24. Solution? khmer-protocols
Read cleaning
Effort to provide standard “cheap”
assembly protocols for Illumina
mRNAseq & metagenomes in the
cloud.
Diginorm
Assembly
Entirely copy/paste; ~2-6 days from
raw reads to assembly,
annotations, and differential
expression analysis. ~$150 on
Amazon per data set.
Annotation
RSEM differential
expression
Open, versioned, forkable, citable.
25.
26. “Eel Pond” mRNAseq protocol
Adapter trim &
quality filter
Group transcripts
EBSeq
(Differential
expression
analysis)
Diginorm to C=20
Annotate x
database
Trim highcoverage reads at
low-abundance
k-mers
RSEM (Map QC
reads to count)
Assemble with
Trinity
Extracting
differentially
expressed genes
& graphing
27. “Kalamazoo” metagenome protocol
Adapter trim &
quality filter
Partition
graph
Map reads to
assembly
Diginorm to C=10
Too big to
assemble?
Split into "groups"
Annotate contigs
with abundances
Trim highcoverage reads at
low-abundance
k-mers
Reinflate groups
(optional
Diginorm to C=5
Small enough to assemble?
Assemble!!!
Prokka
33. What khmer-protocols is:
Starting point.
Defensible initial solution to get initial results.
Works on ~80% or more of samples, guesstimated.
Great (?) way to learn
100% reproducible; methods section on
computational analysis is more or less written for you.
Fairly fast and inexpensive (comparatively)
(~$100/data set)
34. What khmer-protocols is not:
The One True Solution.
The Best Solution.
Proprietary.
Closed.
Slow and expensive (comparatively).
35. Speed up/efficiency?
Walltime to complete assemblies
RAM needed to complete assemblies
occ oases occ trinity ocu oases ocu trinity
occ oases occ trinity ocu oases ocu trinity
500
400
Total memory used (GB)
Total walltime (hrs)
75
50
25
300
200
100
0
0
DN RAW
DN RAW
DN RAW
Sample
DN RAW
DN RAW
DN RAW
DN RAW
DN RAW
Sample
Elijah Lowe
36. Diginorm increases sensitivity (very
slightly :)
Evaluation by homology against a reference gene
37 extra from diginorm, vs 17 lost;
64 extra from diginorm, vs 15 lost;
Elijah Lowe
37. Please use!
Would love feedback: what worked? What didn’t
work?
Cannot support khmer protocols on HPC, but can
support it in the cloud; iCER may (?) support it on
HPC -- all of the software is installed.
(We are working on better default support for HPC.)
38. Links & more references
ged.msu.edu/angus/ - NGS course materials
khmer-protocols.readthedocs.org – khmer
protocols
Cloud computing discussion next Wed, 1/22,
2pm, iCER. Don’t e-mail me at: ctb@msu.edu
Editor's Notes
Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.