Marker Gene Analysis: Best Practices

Marker Gene Analysis
Best Practices

Susan Huse
Marine Biological Laboratory /
Brown University
October 17, 2012

Cleaning Data
Filtering:
Remove reads that are likely to be overall low-quality and have
errors throughout the read.

Quality Trimming:
trim off nucleotides from the end(s) of the read based on local
quality values.

Denoising:
Adjust nucleotides that are more likely to be an error in base-calling
(noise) than a true low-frequency variation (signal)

Anchor Trimming:
trim the end of long amplicons to a conserved location in the SSU
alignment

Chimera Removal:
remove hybrid sequences created during amplification

Recommended 454 Filtering
•  Exact match to barcode and proximal primer
•  Optional denoising (currently only 454)
•  Remove sequences
–  with Ns
–  that are too short
–  Below average or window quality threshold
•  Trim to distal primer or anchor
–  Remove sequences without anchor / primer

SSU rRNA
Anchor Trimming
Next-gen sequences often do not reach to the distal primer,
and reads may have a range of lengths.

De novo OTU clustering and other sequence comparisons
are more consistent if all tags are trimmed to the same
start and stop positions in the rRNA alignment.

Anchor trimming uses a highly conserved location situated
within the read length and truncates all reads to that
position. Be careful that the anchor is the unique and
present across all taxa.

An Illumina HiSeq Error Distribution
Quality Scores for Error Positions
100%

90%
Cumulative Percent of Errors

80%

70%

60%
80% of error bases have
50%
a quality score <=16
40%

30%

20%

10%

0%
0 5 10 15 20 25 30 35 40
Quality Score

Untrimmed Data
Before trimming, most errors have low Q scores

HiSeq Reads with Ns
NTAGCACCAAACATAAATCACCTCACTTAAGTGGCTGGAGACAAATAATCTCTTTAATAACCTGATTCAGCGAAACCAATCCGCGGCATTTAGTAGCGGTA!
NTAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATG!
NGCGCCAATATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTGATGAACTAAGTCAACCTCAGCACTAACCTTGCGAGTCATTTCTTTGATTTGGTCAT!
NGTAAAAATGTCTACAGTAGAGTCAATAGCAAGGCCACGACGCAATGGAGAAAGACGGAGAGCGCCAACGGCGTCCATCTCGAAGGAGTCGCCAGCGATAA!
NTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTC!
CAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTNTNNNNNAATNNNNNNNNNNNNNNNNNNNNNNNCANNNNNTNGNNNNANNNNNTTGAGTGTGAGGT!
CGGATTGTTCAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTNAACANNNNNNNNNNNNNATAGTAATCCACGCTCTTNTAANATGTCAACAAGAG!
TATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGAATTTTACCAATGACCANNNCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAG!
TAGAAGTCGTCATTTGGCGAGAAAGCTCAGTCTCAGGAGGAAGCGGAGCAGTCCAAANNNTTTTGAGATGGCAGCAACGGAAACCATAACGAGCATCATCT!
TGCTGTTGAGTGGTCTCATGACAATAAAGTATGTCNCTGNNTTGAAGNNTNNNNNNNNNNNNNNNCTNATACAATCACGCNCANNNNNAAAAGTGTCGTGT!
CTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGNCTTANNNNNNNNNNNNTGGCGACCCTGTTTTGTATGGCANCTTGCCGCCGCGT!
CGGCAGAAGCCTGAATGAGCTTAATAGAGGCCAAAGCGGTCTGGAAACGTACGGATTNNNNAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTGAA!
GTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGNTGGTNNCNNNNNNNNNAAATTGTTTGGAGGCGGTCAAAANGCCGCCTCCGGTG!
ATATCAACCACACCAGAAGCAGCATCAGTGACGACATTAGAAATATCCTTTGNAGTNNNNNNNNTATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTG!
!

In this dataset:
•  68 reads contained at least 1 N, of these:
•  14 (21%) could not be mapped to PhiX,
•  7 of those 14 (50%) had only 1 N
•  24 (35%) contain more than 1 N
Illumina

Minoche Filtering for Illumina
Table 2: Expected error rates based on Q-scores
(% of bases lost)

No filter

Illumina Chastity (ChF)

Low-Quality (B) tails

Ns
<1/3 of nt Q<30 in 1st half

avgQ < 30 1st 30% of nt

All filters

Minoche A, et al. 2011. Genome Biology 12: R112
using Bambus vulgaris, Arabidopsis thaliana, and PhiX

Remaining Errors
Quality Scores for Error Positions
100%

90%

80%

70%
PCR
errors?
Pct of Errors

60%

50%

40%

30%

20%

10%

0%
0 5 10 15 20 25 30 35 40
Quality Score

Trimmed Data Untrimmed Data
Illumina

QIIME Illumina Pipeline

•  Single mismatch to barcode

•  Trim read to last position above quality threshold q

•  Remove sequences less than length threshold p

•  Remove sequences with more than n Ns

Paired-End Filtering
A small insert size allows for sequence overlap

Read 1 (forward)
Area of sequence overlap
Read 2 (reverse)

Keep only reads that match exactly throughout the region of
overlap.
Amplicons designed to completely overlap (e.g., V6) ensure
the highest quality sequences.

But Variation Still Exists
E. coli K-12 V6 paired end with complete perfect overlap

ACAATCTGT G C T CAG ACT TC AGAGAT GA TG TG C TCG G ACTGTGAGA
C
AA
A
T
C TCCAG
G
A
C A
T
G
T C T
C
AGA
TT T
G
T
C
C G
A GG T C C C
A
T A GA G A GG
T T
T
CA
A
TC
G
A
AGAGT T GC
C A A
C A
T
G
T CC G
A A
T
AA
C
C
A GGT GT
A
C
ACA
A
GA
C GA C
T
G
T
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
5′ 3′
weblogo.berkeley.edu

Is this:
1.  systematic bidirectional sequencing error (unlikely)
2.  PCR error, or
3.  natural variation?

What are Chimeras
and
How do we find them?

5’ PCR primer primer anneals
3’ to complementary target

5’ Extension creates
3’ double-stranded amplicon

But…

5’ 3’ Premature dissociation
3’ terminates elongation
conserved
5’ region The incomplete strand binds to a
3’
3’ different template at a conserved
region…
5’
3’ …then extends to create a chimera

5’ The chimera can act as a template
3’ during the next PCR round.

Chimera Detection
1.  Look for the best match to the left (left parent)
Parent A
Chimeric Read

2.  Look for the best match to the right (right parent)
Chimeric Read
Parent B

3.  Compare the distance between the two parents – are they
really different or multiple entries for the same organism
Parent A
Parent B

Detection methods
differ by source of parents

1.  Reference Comparison:
check against known reference sequences

2.  De novo detection:
check all triplets in your amplification

Reference Comparison
only as good as the Ref Set
•  Can only find parents if they are in the RefSet

•  Any chimeras in the Ref Set are deleterious!

•  Sparse RefSet may not detect chimeras from closely
related organisms (intra-genera, intra-species)

•  Differential density of the Ref Set can create biases

•  Poor matches to the Ref Set can be mistaken for
chimeras

•  Hard to detect if parents are similar, but may not matter

De Novo Pros and Cons
•  Can detect parents not in the RefSet: novel, close
neighbors, PCR errors, unexpected amplifications

•  Must be run by amplification , ie. by tube
All your parents but only your parents

•  Abundance profile can be tricky with long tail

•  Early False Positives (parent is lost to RefSet)
and False Negatives (chimera add to RefSet)
will affect downstream calls

We use both de novo and ref

Rates of Chimera Formation in BPC Datasets
As a function of total reads,Various Datasets
Percent Chimeric for not unique sequences
70%

60%
Percenct of Datasets

50%

40%

30%

20%

10%

0%
0% 10% 20% 30% 40% 50%
Percent of Reads that are Chimeric

V6V4 V3V5

Chimera detection programs
optimized for short reads
•  UChime (in USearch, QIIME and VAMPS)
•  Perseus (in AmpliconNoise and mothur)

Aggregating
Downstream analytical techniques that compensate for
inaccuracies in the remaining sequence data.

Taxonomic assignments will generally remain the same
despite a few mismatches. More so at coarser
taxonomic levels (class vs. genus)

OTU Clustering can round out small percentages of
errors depending on the algorithm used. Clustering at
3% can (but does not always!) aggregate sequences
with 1 – 2% errors.

“Aggregating” is not accepted terminology in the field

Taxonomic Filtering
In addition to knowledge base associated with taxnomic
names:
•  Can filter many unintended PCR amplification products.
•  Reads too far from the tree can be classified as
“Unknown” and examined further.
•  Important to map reads to all domains, not just Bacteria,
primers can amplify across domains and organelles

Amplification
of other Domains
SSU Total
Archaea Bacteria Organelle Unknown
region Reads

V6 529,359 0.02% 96% 4% 0.1%

V6-V4 3,437,855 0.3% 87% 8% 4%

Samples from Little Sippewissett Marsh.
Organelles include mitochondria and chloroplasts

Non SSU rRNA Amplification

Conserved inner
membrane protein
cardiolipin synthase

DNA binding
transcriptional dual Predicted
regulator, tyrosine- antibiotic
16S rRNA binding transporter
Putative transport
system permease 16S rRNA
protein
Predicted
major pilin
subunit

Thank you, Hilary

Taxonomy

GAST:
Global Alignment of Sequence Taxonomy
Use sequence alignment to compare against a RefSet
Distance = alignment distance to nearest RefSet sequence
(SILVA, Greengenes, Stajich Refs, UNITE, HOMD, etc)
(VAMPS)

RDP:
Ribosomal Database Project
Uses k-mer matching to find nearest genus
Boot strap values reflect confidence in the assignment
(RDP Training set, Greengenes, etc.)
(QIIME, VAMPS)

Sources of Error
in Taxonomic Analyses

•  Primer bias
•  Chimeras
•  Discovery of novel 16S
•  Unrepresented in reference database
•  Low-quality references
•  Taxonomy not available
•  Incorrect taxonomy in RefSet
•  Ambiguous hypervariable sequence (>1 hit)
•  RefSets often biased toward most studied

Creating OTUs:

Operational Taxonomic Units
for taxonomy independent analyses

OTUs vs Taxonomy
•  Novel organisms

•  Many unnamed organisms

•  Some clades only defined to phyla or class

•  Many species names based on phenotype rather than
genotype

•  Do not lump together all 16S “unknowns” or diverse
partially classified.

Clustering Algorithms

Different clustering algorithms
can have very different effects on the
size and number of OTUs created…

Clustering Methods
De novo (open)
•  greedy clusters - test sequentially and incorporate
sequence into first qualifying OTU. Dependent on input
order.
•  average linkage - the average distance from a sequence
to every other sequence in the OTU is less than the
width. Dependent on input order.
[complete and single linkage are other methods]
Reference (closed)
•  greedy - map each sequence to representative
sequences defining prebuilt clusters

The Problem of OTU Inflation
De novo clustering algorithms return more OTUs than
predicted for mock communities.

OTU inflation leads to:
•  alpha diversity inflation
•  beta diversity inflation

Where does this inflation come from?
•  residual sequencing errors,
•  chimeras,
•  multiple sequence alignments,
•  clustering algorithms

Rarefaction, Sample Size
under OTU Inflation
M2FN PML MS-CL - PML
Rarefaction
7000

6000

5000
5K
4000
OTUs

10K
3000 15K
20K
2000
50K
1000
100K
0
- 20,000 40,000 60,000 80,000 100,000 120,000
Number of Sequences Sampled

Rarefaction, Sample Size
with minimal OTU Inflation
PML SLP-PW-AL

Cluster to Reference

1.  Create a comprehensive set of Cluster
Representatives (e.g., new Greengenes) representing
the breadth of Bacteria
2.  Assign each sequence to ClusterRep <= W
3.  If Seq is not a member of any cluster, set aside
4.  Cluster denovo the set of extra-cluster sequences

Advantages of
clustering to full-length reference

•  Not as prone to OTU inflation
•  Can add new data as available
•  Provides static Cluster IDs
–  Can be used to compare short reads from
different regions (v3-v5 and v6)
–  Can compare with other projects using same Ref
Set

Oligotyping
•  Further differentiation within closely related organisms
(e.g., genus)
•  Rather than blanket 3% clustering, select sequence
positions with the most information (Shannon Entropy)
Fusobacterium oligtypes across oral sites

supragingival
hard palate

subgingival
keratinized
mucosa

dorsum
gingiva

plaque

tongue
buccal

plaque
tonsils

saliva

throat

“But I’m not interested in the
rare biosphere,
only the major players.

Can’t I just remove the low
abundance OTUs?”

900
350
7000
800 A small number of highly
300
6000 abundant organisms
700
Count in OTU

250
5000
Count in OTU

600
200
4000
500
400
3000
150
300 A large number of low
2000
100 Rare Biosphere abundance organisms
200
1000
50
100
0
0
0
0 50 20 100
50 100 150
40 200
60 250 80300 350
100
OTU Rank
Rank

Consistent community profile
across samples and environments
Sogin et al, 2006. Microbial diversity in the deep sea and the underexplored “rare biosphere” PNAS 103: 12115-12120

Distribution of OTU relative abundances
across 210 HMP stool samples

Huse et al. (2012) PLoS ONE

Distribution of OTU Absolute Abundances
in EnglishEnglish Channel Water Abundances
Channel Water Samples
Distribution of OTU Absolute
in Samples
OTUs

Frequency in PML Samples

Absent Singleton Doubleton 3-5 6-10 11-50 51-500 >500

Everything may not be everywhere,

but everything is rare somewhere!

If you feel you must remove low abundance OTUs,
don’t do it until you have clustered
ALL of your samples

Alpha and Beta Diversity:

Impacts of Sampling Depth
and Diversity Algorithm

Alpha Diversity - Richness
1,800
1,600
1,400 CL - ACE
1,200 SLP - ACE

1,000 CL - Chao
SLP - Chao
800
1 in 5000
600 1 in 2500
400 1 in 1000
1 in 500
200
-
- 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000

Alpha diversity metrics are sensitive to cluster
method, sequencing depth and rare OTUs

Sampling Depth and Alpha Diversity
5
4
4
3
Diversity

3
2
2
1
1
0
- 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000
Sampling Depth

SLP - NPShannon SLP - Simpson CL - NPShannon Simpson

Robust to both singletons and depth

Comparing Different
Sampling Depths
The “population” is a set of 50,000 reads from one sample

The “samples” are randomly-selected subsets of sizes:
1,000 15,000
5,000 20,000
7,500 25,000
10,000

Calculate subsample diversity estimates across subsample
depths which are representing the same population.

Community Distance of Subsamples
0.12

0.1
Community Distance

0.08

0.06

0.04

0.02

0
Replicates

Bray Curtis (1K) Bray-Curtis (5K) Morisita Horn (1K) Morisita Horn (5K)

Subsample 1,000 and 5,000 reads from sample of 50,000 reads,
Pairwise distances for replicates at single depth

Effect of Sample Depth - Bray Curtis

Nearly 100% Different
1.000
0.900
0.800
0.700
0.600
0.500
0.400
0.300
25000
0.200 20000
15000
0.100
10000
0.000 7500
5000
1000 5000
7500 10000 1000
15000 20000
25000

Bray Curtis uses absolute counts,
intra-community distances are high as depths diverge

Effect of Sample Depth - Morisita Horn

0.009 Nearly 0.5% Different
0.008
0.007
0.006
0.005
0.004
0.003 25,000
20,000
0.002 15,000
0.001 10,000
0.000 7,500
5,000
1,000 5,000 1,000
7,500
10,000
15,000
20,000

Beta diversity metric that uses relative abundances and
compensates for different sample sizes.
Distances are low across depths above min.sampling depth.

SLP Clustering and Bray-Curtis
0.4

0.3

0.2 1,000
2,000
5,000
0.1
7,500
PC 2

10,000
0
15,000
20,000
-0.1 25,000
30,000
-0.2 40,000

-0.3
PC 1
-0.4 -0.2 0 0.2 0.4 0.6 0.8

Bray-Curtis PCoA clusters entirely on depth
(each point represents 10 atop one another)

&'!#"()*+,-./0#1.+2#34-.*.+5#64-/#
"#"$&

"#""'&

"#""(&

"#"")&

"#""%&
&$+"""&&

"& &*+"""&&
!"#$#

&,+*""&&
!"#""%& &$"+"""&&
&$*+"""&&
!"#"")&
&%"+"""&&
!"#""(& &%*+"""&&

!"#""'&

!"#"$&

!"#"$%&
!"#%#
!"#"$*& !"#"$& !"#""*& "& "#""*& "#"$&

Minimum sample depth here of 10,000,
but will be a function of the diversity of the sample

Acknowledgements

The Josephine Bay Paul Center
for Comparative Molecular Biology and Evolution

Mitch Sogin
Andy Voorhis
Anna Shipunova
David Mark Welch
A. Murat Eren
Hilary Morrison
Joe Vineis
Sharon Grim

Why filter infrequent errors?

Average 454 Errors / Percent of
Ns
Error Rate 400nt Reads

0 or more 0.40% 1.6 100%
0 0.40% 1.6 99.3%

If we include all reads with or without Ns,
we have an overall error rate of 0.4%.
If, however we remove all <1% of sequences with Ns,
we have an overall error rate of 0.4%.
Why bother??
454

Why filter infrequent errors?
Average Error Errors / Percent of
Ns Rate 400nt Reads
0 0.40% 1.6 99.3%
1 1.11% 3.1 0.57%
2 3.81% 8.7 0.1%
3 7.26% 16.5 0.0%
4 8.40% 19.2 0.0%
5 10.46% 25.1 0.0%

It’s not just improving the overall error rate,
but removing spurious data
Low-quality reads can be interpreted as unique organisms:
0.7% of 500,000 reads = 3,500 “unique organisms”

454 Error Distribution
Distribution of errors in short reads (<100nt)

Most reads contain
no errors at all

454 Errors are not evenly distributed among reads:
Many reads have only a small number of errors, and
a small number of reads have many errors 454

A good beginning
can mask a bad end
If 450 nt read and first 400nt average 35:

if last 50 have an average of 0
avg qual = ((400*35) + (50*0)) / 450
= 31
if last 100 have an average of 25
avg qual = ((350*35) + (100*25)) / 500
= 30

Longer reads,
pushing the limits

454 Filter Summary
Percent Average Average
of Reads Error Rate Errors /
400 nt
N=0 99% 0.40% 1.6
N>=1 1% 0.91% 3.6

Exact Primer 95% 0.38% 1.5
Not Exact Primer 5% 0.84% 3.4

Average Qual >=30 98% 0.90% 3.6
Average Qual <30 2% 1.3% 5.2

454

454 Filter Summary (cont)
Percent Average Average
of Reads Error Rate Errors /
400 nt
Read Length 99+% 0.39% 1.6
(500 - 600nt)
Read Length 0.1% 1.8% 7.2
(<500, >600 nt)

Filtered 93% 0.36% 1.4
Unfiltered 7% 0.64% 2.6

454

Evaluating Chimeras (USearch)
Parent A
Query
Parent B

Diffs: A,B: Q matches expected P a,b: Q matches other P p: A=B!=Q
Votes: + for Model, 0 neutral, ! against Model
Model: shows extent of Parent A and Parent B, xxxx is overlap matching A&B

Initial Length: 277

Extent of your
sequence

Click on the bar to
see the alignment

Extent of your
match

Check for left and right parents:
BLAST the left (1-175)
BLAST the right (175 - 277)

100% Match to
Fusobacterium
1

175

100% Match to
Pseudomonas
175

277

Taxonomic Names

•  Bergey’s Taxonomic Outline – manual of
taxonomic names for bacteria
•  List of Prokaryotic names with Standing in the
Nomenclature (vetting process)
•  NCBI – similar taxonomy, but multiple
“subs” (subclass, suborder, subfamily, tribe)
•  Archaea – a work in progress…
•  Fungi – another work in progress…

Cluster “Width”
Diameter Radius
Sequences are Sequences are
never more than never more than
D apart. R from seed.
(CL) (SL, AL, Gr)

Average Linkage
collapses errors

Cluster
Count:

1

#1

Clusters
tend
to
be
heavily
dominated
by
their
most
abundant

sequence,
which
strongly
weights
the
average
and
smoothes
the
noise.

Still lose outlier
sequencing errors

Multiple sequencing errors still not clustered

Inflation in Action:
Multiple Sequence Alignment
and Complete Linkage clustering

1,042 is a few more
than the expected 2

Example MSA

Regardless of clustering algorithm,
an MSA cannot fully align tags whose
sequences are too divergent

18,156 sequences and 392 positions

Relative Inflation

Absolute
number of
errant OTUs
will increase
with sample
size.

Relative
number of
errant OTUs
will
descrease
with sample
complexity

The
Magical 3%

3% SSU OTUs = Species
and
6% SSU OTUs = Genera

NOT!

Clustering Questions

•  How meaningful are clusters functionally?
•  When is an errare rare and when is it an error?
•  Should it be included in an existing cluster or start its
own?
•  How to place sequences if OTUs overlap?
•  What is the effect of residual low quality data or
chimeras?
•  How sensitive are alpha and beta diversity estimates to
clustering results?

Marker Gene Analysis: Best Practices

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (18)

Marker Gene Analysis: Best Practices