1. Marker Gene Analysis
Best Practices
Susan Huse
Marine Biological Laboratory /
Brown University
October 17, 2012
2. Cleaning Data
Filtering:
Remove reads that are likely to be overall low-quality and have
errors throughout the read.
Quality Trimming:
trim off nucleotides from the end(s) of the read based on local
quality values.
Denoising:
Adjust nucleotides that are more likely to be an error in base-calling
(noise) than a true low-frequency variation (signal)
Anchor Trimming:
trim the end of long amplicons to a conserved location in the SSU
alignment
Chimera Removal:
remove hybrid sequences created during amplification
3. Recommended 454 Filtering
• Exact match to barcode and proximal primer
• Optional denoising (currently only 454)
• Remove sequences
– with Ns
– that are too short
– Below average or window quality threshold
• Trim to distal primer or anchor
– Remove sequences without anchor / primer
4. SSU rRNA
Anchor Trimming
Next-gen sequences often do not reach to the distal primer,
and reads may have a range of lengths.
De novo OTU clustering and other sequence comparisons
are more consistent if all tags are trimmed to the same
start and stop positions in the rRNA alignment.
Anchor trimming uses a highly conserved location situated
within the read length and truncates all reads to that
position. Be careful that the anchor is the unique and
present across all taxa.
5. An Illumina HiSeq Error Distribution
Quality Scores for Error Positions
100%
90%
Cumulative Percent of Errors
80%
70%
60%
80% of error bases have
50%
a quality score <=16
40%
30%
20%
10%
0%
0 5 10 15 20 25 30 35 40
Quality Score
Untrimmed Data
Before trimming, most errors have low Q scores
6. HiSeq Reads with Ns
NTAGCACCAAACATAAATCACCTCACTTAAGTGGCTGGAGACAAATAATCTCTTTAATAACCTGATTCAGCGAAACCAATCCGCGGCATTTAGTAGCGGTA!
NTAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATG!
NGCGCCAATATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTGATGAACTAAGTCAACCTCAGCACTAACCTTGCGAGTCATTTCTTTGATTTGGTCAT!
NGTAAAAATGTCTACAGTAGAGTCAATAGCAAGGCCACGACGCAATGGAGAAAGACGGAGAGCGCCAACGGCGTCCATCTCGAAGGAGTCGCCAGCGATAA!
NTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTC!
CAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTNTNNNNNAATNNNNNNNNNNNNNNNNNNNNNNNCANNNNNTNGNNNNANNNNNTTGAGTGTGAGGT!
CGGATTGTTCAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTNAACANNNNNNNNNNNNNATAGTAATCCACGCTCTTNTAANATGTCAACAAGAG!
TATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGAATTTTACCAATGACCANNNCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAG!
TAGAAGTCGTCATTTGGCGAGAAAGCTCAGTCTCAGGAGGAAGCGGAGCAGTCCAAANNNTTTTGAGATGGCAGCAACGGAAACCATAACGAGCATCATCT!
TGCTGTTGAGTGGTCTCATGACAATAAAGTATGTCNCTGNNTTGAAGNNTNNNNNNNNNNNNNNNCTNATACAATCACGCNCANNNNNAAAAGTGTCGTGT!
CTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGNCTTANNNNNNNNNNNNTGGCGACCCTGTTTTGTATGGCANCTTGCCGCCGCGT!
CGGCAGAAGCCTGAATGAGCTTAATAGAGGCCAAAGCGGTCTGGAAACGTACGGATTNNNNAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTGAA!
GTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGNTGGTNNCNNNNNNNNNAAATTGTTTGGAGGCGGTCAAAANGCCGCCTCCGGTG!
ATATCAACCACACCAGAAGCAGCATCAGTGACGACATTAGAAATATCCTTTGNAGTNNNNNNNNTATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTG!
!
In this dataset:
• 68 reads contained at least 1 N, of these:
• 14 (21%) could not be mapped to PhiX,
• 7 of those 14 (50%) had only 1 N
• 24 (35%) contain more than 1 N
Illumina
7. Minoche Filtering for Illumina
Table 2: Expected error rates based on Q-scores
(% of bases lost)
No filter
Illumina Chastity (ChF)
Low-Quality (B) tails
Ns
<1/3 of nt Q<30 in 1st half
avgQ < 30 1st 30% of nt
All filters
Minoche A, et al. 2011. Genome Biology 12: R112
using Bambus vulgaris, Arabidopsis thaliana, and PhiX
8. Remaining Errors
Quality Scores for Error Positions
100%
90%
80%
70%
PCR
errors?
Pct of Errors
60%
50%
40%
30%
20%
10%
0%
0 5 10 15 20 25 30 35 40
Quality Score
Trimmed Data Untrimmed Data
Illumina
9. QIIME Illumina Pipeline
• Single mismatch to barcode
• Trim read to last position above quality threshold q
• Remove sequences less than length threshold p
• Remove sequences with more than n Ns
10. Paired-End Filtering
A small insert size allows for sequence overlap
Read 1 (forward)
Area of sequence overlap
Read 2 (reverse)
Keep only reads that match exactly throughout the region of
overlap.
Amplicons designed to completely overlap (e.g., V6) ensure
the highest quality sequences.
11. But Variation Still Exists
E. coli K-12 V6 paired end with complete perfect overlap
ACAATCTGT G C T CAG ACT TC AGAGAT GA TG TG C TCG G ACTGTGAGA
C
AA
A
T
C TCCAG
G
A
C A
T
G
T C T
C
AGA
TT T
G
T
C
C G
A GG T C C C
A
T A GA G A GG
T T
T
CA
A
TC
G
A
AGAGT T GC
C A A
C A
T
G
T CC G
A A
T
AA
C
C
A GGT GT
A
C
ACA
A
GA
C GA C
T
G
T
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
5′ 3′
weblogo.berkeley.edu
Is this:
1. systematic bidirectional sequencing error (unlikely)
2. PCR error, or
3. natural variation?
13. 5’ PCR primer primer anneals
3’ to complementary target
5’ Extension creates
3’ double-stranded amplicon
But…
5’ 3’ Premature dissociation
3’ terminates elongation
conserved
5’ region The incomplete strand binds to a
3’
3’ different template at a conserved
region…
5’
3’ …then extends to create a chimera
5’ The chimera can act as a template
3’ during the next PCR round.
14. Chimera Detection
1. Look for the best match to the left (left parent)
Parent A
Chimeric Read
2. Look for the best match to the right (right parent)
Chimeric Read
Parent B
3. Compare the distance between the two parents – are they
really different or multiple entries for the same organism
Parent A
Parent B
15. Detection methods
differ by source of parents
1. Reference Comparison:
check against known reference sequences
2. De novo detection:
check all triplets in your amplification
16. Reference Comparison
only as good as the Ref Set
• Can only find parents if they are in the RefSet
• Any chimeras in the Ref Set are deleterious!
• Sparse RefSet may not detect chimeras from closely
related organisms (intra-genera, intra-species)
• Differential density of the Ref Set can create biases
• Poor matches to the Ref Set can be mistaken for
chimeras
• Hard to detect if parents are similar, but may not matter
17. De Novo Pros and Cons
• Can detect parents not in the RefSet: novel, close
neighbors, PCR errors, unexpected amplifications
• Must be run by amplification , ie. by tube
All your parents but only your parents
• Abundance profile can be tricky with long tail
• Early False Positives (parent is lost to RefSet)
and False Negatives (chimera add to RefSet)
will affect downstream calls
We use both de novo and ref
18. Rates of Chimera Formation in BPC Datasets
As a function of total reads,Various Datasets
Percent Chimeric for not unique sequences
70%
60%
Percenct of Datasets
50%
40%
30%
20%
10%
0%
0% 10% 20% 30% 40% 50%
Percent of Reads that are Chimeric
V6V4 V3V5
19. Chimera detection programs
optimized for short reads
• UChime (in USearch, QIIME and VAMPS)
• Perseus (in AmpliconNoise and mothur)
20. Aggregating
Downstream analytical techniques that compensate for
inaccuracies in the remaining sequence data.
Taxonomic assignments will generally remain the same
despite a few mismatches. More so at coarser
taxonomic levels (class vs. genus)
OTU Clustering can round out small percentages of
errors depending on the algorithm used. Clustering at
3% can (but does not always!) aggregate sequences
with 1 – 2% errors.
“Aggregating” is not accepted terminology in the field
21. Taxonomic Filtering
In addition to knowledge base associated with taxnomic
names:
• Can filter many unintended PCR amplification products.
• Reads too far from the tree can be classified as
“Unknown” and examined further.
• Important to map reads to all domains, not just Bacteria,
primers can amplify across domains and organelles
22. Amplification
of other Domains
SSU Total
Archaea Bacteria Organelle Unknown
region Reads
V6 529,359 0.02% 96% 4% 0.1%
V6-V4 3,437,855 0.3% 87% 8% 4%
Samples from Little Sippewissett Marsh.
Organelles include mitochondria and chloroplasts
23. Non SSU rRNA Amplification
Conserved inner
membrane protein
cardiolipin synthase
DNA binding
transcriptional dual Predicted
regulator, tyrosine- antibiotic
16S rRNA binding transporter
Putative transport
system permease 16S rRNA
protein
Predicted
major pilin
subunit
Thank you, Hilary
24. Taxonomy
GAST:
Global Alignment of Sequence Taxonomy
Use sequence alignment to compare against a RefSet
Distance = alignment distance to nearest RefSet sequence
(SILVA, Greengenes, Stajich Refs, UNITE, HOMD, etc)
(VAMPS)
RDP:
Ribosomal Database Project
Uses k-mer matching to find nearest genus
Boot strap values reflect confidence in the assignment
(RDP Training set, Greengenes, etc.)
(QIIME, VAMPS)
25. Sources of Error
in Taxonomic Analyses
• Primer bias
• Chimeras
• Discovery of novel 16S
• Unrepresented in reference database
• Low-quality references
• Taxonomy not available
• Incorrect taxonomy in RefSet
• Ambiguous hypervariable sequence (>1 hit)
• RefSets often biased toward most studied
26. Creating OTUs:
Operational Taxonomic Units
for taxonomy independent analyses
27. OTUs vs Taxonomy
• Novel organisms
• Many unnamed organisms
• Some clades only defined to phyla or class
• Many species names based on phenotype rather than
genotype
• Do not lump together all 16S “unknowns” or diverse
partially classified.
28. Clustering Algorithms
Different clustering algorithms
can have very different effects on the
size and number of OTUs created…
29. Clustering Methods
De novo (open)
• greedy clusters - test sequentially and incorporate
sequence into first qualifying OTU. Dependent on input
order.
• average linkage - the average distance from a sequence
to every other sequence in the OTU is less than the
width. Dependent on input order.
[complete and single linkage are other methods]
Reference (closed)
• greedy - map each sequence to representative
sequences defining prebuilt clusters
30. The Problem of OTU Inflation
De novo clustering algorithms return more OTUs than
predicted for mock communities.
OTU inflation leads to:
• alpha diversity inflation
• beta diversity inflation
Where does this inflation come from?
• residual sequencing errors,
• chimeras,
• multiple sequence alignments,
• clustering algorithms
31. Rarefaction, Sample Size
under OTU Inflation
M2FN PML MS-CL - PML
Rarefaction
7000
6000
5000
5K
4000
OTUs
10K
3000 15K
20K
2000
50K
1000
100K
0
- 20,000 40,000 60,000 80,000 100,000 120,000
Number of Sequences Sampled
33. Cluster to Reference
1. Create a comprehensive set of Cluster
Representatives (e.g., new Greengenes) representing
the breadth of Bacteria
2. Assign each sequence to ClusterRep <= W
3. If Seq is not a member of any cluster, set aside
4. Cluster denovo the set of extra-cluster sequences
34. Advantages of
clustering to full-length reference
• Not as prone to OTU inflation
• Can add new data as available
• Provides static Cluster IDs
– Can be used to compare short reads from
different regions (v3-v5 and v6)
– Can compare with other projects using same Ref
Set
35. Oligotyping
• Further differentiation within closely related organisms
(e.g., genus)
• Rather than blanket 3% clustering, select sequence
positions with the most information (Shannon Entropy)
Fusobacterium oligtypes across oral sites
supragingival
hard palate
subgingival
keratinized
mucosa
dorsum
gingiva
plaque
tongue
buccal
plaque
tonsils
saliva
throat
36. “But I’m not interested in the
rare biosphere,
only the major players.
Can’t I just remove the low
abundance OTUs?”
37. 900
350
7000
800 A small number of highly
300
6000 abundant organisms
700
Count in OTU
250
5000
Count in OTU
600
200
4000
500
400
3000
150
300 A large number of low
2000
100 Rare Biosphere abundance organisms
200
1000
50
100
0
0
0
0 50 20 100
50 100 150
40 200
60 250 80300 350
100
OTU Rank
Rank
Consistent community profile
across samples and environments
Sogin et al, 2006. Microbial diversity in the deep sea and the underexplored “rare biosphere” PNAS 103: 12115-12120
38. Distribution of OTU relative abundances
across 210 HMP stool samples
Huse et al. (2012) PLoS ONE
39. Distribution of OTU Absolute Abundances
in EnglishEnglish Channel Water Abundances
Channel Water Samples
Distribution of OTU Absolute
in Samples
OTUs
Frequency in PML Samples
Absent Singleton Doubleton 3-5 6-10 11-50 51-500 >500
40. Everything may not be everywhere,
but everything is rare somewhere!
If you feel you must remove low abundance OTUs,
don’t do it until you have clustered
ALL of your samples
41. Alpha and Beta Diversity:
Impacts of Sampling Depth
and Diversity Algorithm
42. Alpha Diversity - Richness
1,800
1,600
1,400 CL - ACE
1,200 SLP - ACE
1,000 CL - Chao
SLP - Chao
800
1 in 5000
600 1 in 2500
400 1 in 1000
1 in 500
200
-
- 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000
Alpha diversity metrics are sensitive to cluster
method, sequencing depth and rare OTUs
44. Comparing Different
Sampling Depths
The “population” is a set of 50,000 reads from one sample
The “samples” are randomly-selected subsets of sizes:
1,000 15,000
5,000 20,000
7,500 25,000
10,000
Calculate subsample diversity estimates across subsample
depths which are representing the same population.
45. Community Distance of Subsamples
0.12
0.1
Community Distance
0.08
0.06
0.04
0.02
0
Replicates
Bray Curtis (1K) Bray-Curtis (5K) Morisita Horn (1K) Morisita Horn (5K)
Subsample 1,000 and 5,000 reads from sample of 50,000 reads,
Pairwise distances for replicates at single depth
46. Effect of Sample Depth - Bray Curtis
Nearly 100% Different
1.000
0.900
0.800
0.700
0.600
0.500
0.400
0.300
25000
0.200 20000
15000
0.100
10000
0.000 7500
5000
1000 5000
7500 10000 1000
15000 20000
25000
Bray Curtis uses absolute counts,
intra-community distances are high as depths diverge
47. Effect of Sample Depth - Morisita Horn
0.009 Nearly 0.5% Different
0.008
0.007
0.006
0.005
0.004
0.003 25,000
20,000
0.002 15,000
0.001 10,000
0.000 7,500
5,000
1,000 5,000 1,000
7,500
10,000
15,000
20,000
Beta diversity metric that uses relative abundances and
compensates for different sample sizes.
Distances are low across depths above min.sampling depth.
48. SLP Clustering and Bray-Curtis
0.4
0.3
0.2 1,000
2,000
5,000
0.1
7,500
PC 2
10,000
0
15,000
20,000
-0.1 25,000
30,000
-0.2 40,000
-0.3
PC 1
-0.4 -0.2 0 0.2 0.4 0.6 0.8
Bray-Curtis PCoA clusters entirely on depth
(each point represents 10 atop one another)
49. &'!#"()*+,-./0#1.+2#34-.*.+5#64-/#
"#"$&
"#""'&
"#""(&
"#"")&
"#""%&
&$+"""&&
"& &*+"""&&
!"#$#
&,+*""&&
!"#""%& &$"+"""&&
&$*+"""&&
!"#"")&
&%"+"""&&
!"#""(& &%*+"""&&
!"#""'&
!"#"$&
!"#"$%&
!"#%#
!"#"$*& !"#"$& !"#""*& "& "#""*& "#"$&
Minimum sample depth here of 10,000,
but will be a function of the diversity of the sample
50. Acknowledgements
The Josephine Bay Paul Center
for Comparative Molecular Biology and Evolution
Mitch Sogin
Andy Voorhis
Anna Shipunova
David Mark Welch
A. Murat Eren
Hilary Morrison
Joe Vineis
Sharon Grim
51. Why filter infrequent errors?
Average 454 Errors / Percent of
Ns
Error Rate 400nt Reads
0 or more 0.40% 1.6 100%
0 0.40% 1.6 99.3%
If we include all reads with or without Ns,
we have an overall error rate of 0.4%.
If, however we remove all <1% of sequences with Ns,
we have an overall error rate of 0.4%.
Why bother??
454
52. Why filter infrequent errors?
Average Error Errors / Percent of
Ns Rate 400nt Reads
0 0.40% 1.6 99.3%
1 1.11% 3.1 0.57%
2 3.81% 8.7 0.1%
3 7.26% 16.5 0.0%
4 8.40% 19.2 0.0%
5 10.46% 25.1 0.0%
It’s not just improving the overall error rate,
but removing spurious data
Low-quality reads can be interpreted as unique organisms:
0.7% of 500,000 reads = 3,500 “unique organisms”
53. 454 Error Distribution
Distribution of errors in short reads (<100nt)
Most reads contain
no errors at all
454 Errors are not evenly distributed among reads:
Many reads have only a small number of errors, and
a small number of reads have many errors 454
54. A good beginning
can mask a bad end
If 450 nt read and first 400nt average 35:
if last 50 have an average of 0
avg qual = ((400*35) + (50*0)) / 450
= 31
if last 100 have an average of 25
avg qual = ((350*35) + (100*25)) / 500
= 30
56. 454 Filter Summary
Percent Average Average
of Reads Error Rate Errors /
400 nt
N=0 99% 0.40% 1.6
N>=1 1% 0.91% 3.6
Exact Primer 95% 0.38% 1.5
Not Exact Primer 5% 0.84% 3.4
Average Qual >=30 98% 0.90% 3.6
Average Qual <30 2% 1.3% 5.2
454
58. Evaluating Chimeras (USearch)
Parent A
Query
Parent B
Diffs: A,B: Q matches expected P a,b: Q matches other P p: A=B!=Q
Votes: + for Model, 0 neutral, ! against Model
Model: shows extent of Parent A and Parent B, xxxx is overlap matching A&B
59. Initial Length: 277
Extent of your
sequence
Click on the bar to
see the alignment
Extent of your
match
60. Check for left and right parents:
BLAST the left (1-175)
BLAST the right (175 - 277)
61. 100% Match to
Fusobacterium
1
175
100% Match to
Pseudomonas
175
277
62. Taxonomic Names
• Bergey’s Taxonomic Outline – manual of
taxonomic names for bacteria
• List of Prokaryotic names with Standing in the
Nomenclature (vetting process)
• NCBI – similar taxonomy, but multiple
“subs” (subclass, suborder, subfamily, tribe)
• Archaea – a work in progress…
• Fungi – another work in progress…
63. Cluster “Width”
Diameter Radius
Sequences are Sequences are
never more than never more than
D apart. R from seed.
(CL) (SL, AL, Gr)
64. Average Linkage
collapses errors
Cluster
Count:
1
#1
Clusters
tend
to
be
heavily
dominated
by
their
most
abundant
sequence,
which
strongly
weights
the
average
and
smoothes
the
noise.
65. Still lose outlier
sequencing errors
Multiple sequencing errors still not clustered
66. Inflation in Action:
Multiple Sequence Alignment
and Complete Linkage clustering
1,042 is a few more
than the expected 2
67. Example MSA
Regardless of clustering algorithm,
an MSA cannot fully align tags whose
sequences are too divergent
18,156 sequences and 392 positions
68. Relative Inflation
Absolute
number of
errant OTUs
will increase
with sample
size.
Relative
number of
errant OTUs
will
descrease
with sample
complexity
69. The
Magical 3%
3% SSU OTUs = Species
and
6% SSU OTUs = Genera
NOT!
70. Clustering Questions
• How meaningful are clusters functionally?
• When is an errare rare and when is it an error?
• Should it be included in an existing cluster or start its
own?
• How to place sequences if OTUs overlap?
• What is the effect of residual low quality data or
chimeras?
• How sensitive are alpha and beta diversity estimates to
clustering results?