SlideShare une entreprise Scribd logo
1  sur  70
Télécharger pour lire hors ligne
Marker Gene Analysis
   Best Practices



           Susan Huse
   Marine Biological Laboratory /
         Brown University
        October 17, 2012
Cleaning Data
Filtering:
     Remove reads that are likely to be overall low-quality and have
     errors throughout the read.

Quality Trimming:
   trim off nucleotides from the end(s) of the read based on local
   quality values.

Denoising:
   Adjust nucleotides that are more likely to be an error in base-calling
   (noise) than a true low-frequency variation (signal)

Anchor Trimming:
   trim the end of long amplicons to a conserved location in the SSU
   alignment

Chimera Removal:
   remove hybrid sequences created during amplification
Recommended 454 Filtering
•  Exact match to barcode and proximal primer
•  Optional denoising (currently only 454)
•  Remove sequences
   –  with Ns
   –  that are too short
   –  Below average or window quality threshold
•  Trim to distal primer or anchor
   –  Remove sequences without anchor / primer
SSU rRNA
             Anchor Trimming
Next-gen sequences often do not reach to the distal primer,
  and reads may have a range of lengths.

De novo OTU clustering and other sequence comparisons
  are more consistent if all tags are trimmed to the same
  start and stop positions in the rRNA alignment.

Anchor trimming uses a highly conserved location situated
  within the read length and truncates all reads to that
  position. Be careful that the anchor is the unique and
  present across all taxa.
An Illumina HiSeq Error Distribution
                                                Quality Scores for Error Positions
                                 100%

                                 90%
  Cumulative Percent of Errors




                                 80%

                                 70%

                                 60%
                                                                     80% of error bases have
                                 50%
                                                                     a quality score <=16
                                 40%

                                 30%

                                 20%

                                 10%

                                  0%
                                        0   5     10       15       20          25   30   35   40
                                                                Quality Score

                                                      Untrimmed Data
                                   Before trimming, most errors have low Q scores
HiSeq Reads with Ns
 NTAGCACCAAACATAAATCACCTCACTTAAGTGGCTGGAGACAAATAATCTCTTTAATAACCTGATTCAGCGAAACCAATCCGCGGCATTTAGTAGCGGTA!
 NTAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATG!
 NGCGCCAATATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTGATGAACTAAGTCAACCTCAGCACTAACCTTGCGAGTCATTTCTTTGATTTGGTCAT!
 NGTAAAAATGTCTACAGTAGAGTCAATAGCAAGGCCACGACGCAATGGAGAAAGACGGAGAGCGCCAACGGCGTCCATCTCGAAGGAGTCGCCAGCGATAA!
 NTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTC!
 CAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTNTNNNNNAATNNNNNNNNNNNNNNNNNNNNNNNCANNNNNTNGNNNNANNNNNTTGAGTGTGAGGT!
 CGGATTGTTCAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTNAACANNNNNNNNNNNNNATAGTAATCCACGCTCTTNTAANATGTCAACAAGAG!
 TATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGAATTTTACCAATGACCANNNCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAG!
 TAGAAGTCGTCATTTGGCGAGAAAGCTCAGTCTCAGGAGGAAGCGGAGCAGTCCAAANNNTTTTGAGATGGCAGCAACGGAAACCATAACGAGCATCATCT!
 TGCTGTTGAGTGGTCTCATGACAATAAAGTATGTCNCTGNNTTGAAGNNTNNNNNNNNNNNNNNNCTNATACAATCACGCNCANNNNNAAAAGTGTCGTGT!
 CTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGNCTTANNNNNNNNNNNNTGGCGACCCTGTTTTGTATGGCANCTTGCCGCCGCGT!
 CGGCAGAAGCCTGAATGAGCTTAATAGAGGCCAAAGCGGTCTGGAAACGTACGGATTNNNNAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTGAA!
 GTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGNTGGTNNCNNNNNNNNNAAATTGTTTGGAGGCGGTCAAAANGCCGCCTCCGGTG!
 ATATCAACCACACCAGAAGCAGCATCAGTGACGACATTAGAAATATCCTTTGNAGTNNNNNNNNTATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTG!
 !


In this dataset:
    •  68 reads contained at least 1 N, of these:
    •  14 (21%) could not be mapped to PhiX,
    •  7 of those 14 (50%) had only 1 N
    •  24 (35%) contain more than 1 N
                                                                                                 Illumina
Minoche Filtering for Illumina
  Table 2: Expected error rates based on Q-scores
           (% of bases lost)


  No filter

   Illumina Chastity (ChF)

   Low-Quality (B) tails

   Ns
   <1/3 of nt Q<30 in 1st half

   avgQ < 30 1st 30% of nt

   All filters

Minoche A, et al. 2011. Genome Biology 12: R112
using Bambus vulgaris, Arabidopsis thaliana, and PhiX
Remaining Errors
                               Quality Scores for Error Positions
                100%

                90%

                80%

                70%
                                                                              PCR
                                                                              errors?
Pct of Errors




                60%

                50%

                40%

                30%

                20%

                10%

                 0%
                       0   5     10       15          20          25    30   35   40
                                                  Quality Score

                                   Trimmed Data        Untrimmed Data
                                                                                        Illumina
QIIME Illumina Pipeline


•  Single mismatch to barcode

•  Trim read to last position above quality threshold q

•  Remove sequences less than length threshold p

•  Remove sequences with more than n Ns
Paired-End Filtering
 A small insert size allows for sequence overlap

Read 1 (forward)
             Area of sequence overlap
                                         Read 2 (reverse)

 Keep only reads that match exactly throughout the region of
 overlap.
 Amplicons designed to completely overlap (e.g., V6) ensure
 the highest quality sequences.
But Variation Still Exists
                         E. coli K-12 V6 paired end with complete perfect overlap




     ACAATCTGT G C T CAG ACT TC AGAGAT GA TG TG C TCG G ACTGTGAGA
     C
     AA
          A
          T
              C TCCAG
                 G
                 A
                     C   A
                         T
                             G
                             T   C   T
                                          C
                                         AGA
                                         TT   T
                                                  G
                                                  T
                                                  C
                                                         C   G
                                                      A GG T C C C
                                                                 A
                                                                     T A GA G A GG
                                                                      T   T
                                                                                T
                                                                                CA
                                                                                A
                                                                                      TC
                                                                                       G
                                                                                           A
                                                                                     AGAGT T GC
                                                                                           C   A   A
                                                                                                       C   A
                                                                                                               T
                                                                                                               G
                                                                                                                   T CC G
                                                                                                                    A   A
                                                                                                                            T
                                                                                                                                AA
                                                                                                                                C
                                                                                                                                       C
                                                                                                                                     A GGT GT
                                                                                                                                           A
                                                                                                                                               C
                                                                                                                                               ACA
                                                                                                                                                   A
                                                                                                                                                       GA
                                                                                                                                                       C    GA C
                                                                                                                                                            T
                                                                                                                                                                G
                                                                                                                                                                T
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
5′                                                                                                                                                                  3′
                                                                                                                                               weblogo.berkeley.edu




     Is this:
     1.  systematic bidirectional sequencing error (unlikely)
     2.  PCR error, or
     3.  natural variation?
What are Chimeras
       and
How do we find them?
5’                            PCR primer primer anneals
 3’                            to complementary target


5’                             Extension creates
3’                             double-stranded amplicon


                       But…


5’                3’          Premature dissociation
3’                            terminates elongation
      conserved
5’      region                The incomplete strand binds to a
                  3’
3’                            different template at a conserved
                              region…
5’
3’                            …then extends to create a chimera


5’                            The chimera can act as a template
3’                            during the next PCR round.
Chimera Detection
1.  Look for the best match to the left (left parent)
      Parent A
 Chimeric Read



2.  Look for the best match to the right (right parent)
 Chimeric Read
      Parent B



3.  Compare the distance between the two parents – are they
    really different or multiple entries for the same organism
      Parent A
      Parent B
Detection methods
        differ by source of parents

1.    Reference Comparison:
      check against known reference sequences



2.    De novo detection:
      check all triplets in your amplification
Reference Comparison
      only as good as the Ref Set
•  Can only find parents if they are in the RefSet

•  Any chimeras in the Ref Set are deleterious!

•  Sparse RefSet may not detect chimeras from closely
   related organisms (intra-genera, intra-species)

•  Differential density of the Ref Set can create biases

•  Poor matches to the Ref Set can be mistaken for
   chimeras

•  Hard to detect if parents are similar, but may not matter
De Novo Pros and Cons
•  Can detect parents not in the RefSet: novel, close
   neighbors, PCR errors, unexpected amplifications

•  Must be run by amplification , ie. by tube
   All your parents but only your parents

•  Abundance profile can be tricky with long tail

•  Early False Positives (parent is lost to RefSet)
   and False Negatives (chimera add to RefSet)
   will affect downstream calls



       We use both de novo and ref
Rates of Chimera Formation in BPC Datasets
                         As a function of total reads,Various Datasets
                              Percent Chimeric for not unique sequences
                       70%

                       60%
Percenct of Datasets




                       50%

                       40%

                       30%

                       20%

                       10%

                       0%
                             0%   10%        20%          30%         40%   50%
                                     Percent of Reads that are Chimeric

                                            V6V4      V3V5
Chimera detection programs
    optimized for short reads
•  UChime (in USearch, QIIME and VAMPS)
•  Perseus (in AmpliconNoise and mothur)
Aggregating
       Downstream analytical techniques that compensate for
       inaccuracies in the remaining sequence data.

       Taxonomic assignments will generally remain the same
       despite a few mismatches. More so at coarser
       taxonomic levels (class vs. genus)

       OTU Clustering can round out small percentages of
       errors depending on the algorithm used. Clustering at
       3% can (but does not always!) aggregate sequences
       with 1 – 2% errors.


“Aggregating” is not accepted terminology in the field
Taxonomic Filtering
In addition to knowledge base associated with taxnomic
names:
•    Can filter many unintended PCR amplification products.
•    Reads too far from the tree can be classified as
     “Unknown” and examined further.
•    Important to map reads to all domains, not just Bacteria,
     primers can amplify across domains and organelles
Amplification
           of other Domains
  SSU      Total
                      Archaea   Bacteria   Organelle   Unknown
 region    Reads


  V6      529,359     0.02%      96%          4%        0.1%



 V6-V4    3,437,855    0.3%      87%          8%         4%




Samples from Little Sippewissett Marsh.
Organelles include mitochondria and chloroplasts
Non SSU rRNA Amplification


                  Conserved inner
                  membrane protein
                  cardiolipin synthase


                            DNA binding
                            transcriptional dual          Predicted
                            regulator, tyrosine-          antibiotic
              16S rRNA      binding                       transporter
                                     Putative transport
                                     system permease                    16S rRNA
                                     protein
Predicted
major pilin
subunit




                                                                            Thank you, Hilary
Taxonomy

GAST:
  Global Alignment of Sequence Taxonomy
  Use sequence alignment to compare against a RefSet
  Distance = alignment distance to nearest RefSet sequence
  (SILVA, Greengenes, Stajich Refs, UNITE, HOMD, etc)
  (VAMPS)

RDP:
  Ribosomal Database Project
  Uses k-mer matching to find nearest genus
  Boot strap values reflect confidence in the assignment
  (RDP Training set, Greengenes, etc.)
  (QIIME, VAMPS)
Sources of Error
      in Taxonomic Analyses

•    Primer bias
•    Chimeras
•    Discovery of novel 16S
•    Unrepresented in reference database
•    Low-quality references
•    Taxonomy not available
•    Incorrect taxonomy in RefSet
•    Ambiguous hypervariable sequence (>1 hit)
•    RefSets often biased toward most studied
Creating OTUs:

    Operational Taxonomic Units
for taxonomy independent analyses
OTUs vs Taxonomy
•  Novel organisms

•  Many unnamed organisms

•  Some clades only defined to phyla or class

•  Many species names based on phenotype rather than
   genotype

•  Do not lump together all 16S “unknowns” or diverse
   partially classified.
Clustering Algorithms


    Different clustering algorithms
can have very different effects on the
 size and number of OTUs created…
Clustering Methods
De novo (open)
•  greedy clusters - test sequentially and incorporate
   sequence into first qualifying OTU. Dependent on input
   order.
•  average linkage - the average distance from a sequence
   to every other sequence in the OTU is less than the
   width. Dependent on input order.
   [complete and single linkage are other methods]
Reference (closed)
•  greedy - map each sequence to representative
   sequences defining prebuilt clusters
The Problem of OTU Inflation
De novo clustering algorithms return more OTUs than
  predicted for mock communities.

OTU inflation leads to:
  •  alpha diversity inflation
  •  beta diversity inflation

Where does this inflation come from?
  •  residual sequencing errors,
  •  chimeras,
  •  multiple sequence alignments,
  •  clustering algorithms
Rarefaction, Sample Size
          under OTU Inflation
                      M2FN PML MS-CL - PML
                           Rarefaction
       7000

       6000

       5000
                                                                      5K
       4000
OTUs




                                                                      10K
       3000                                                           15K
                                                                      20K
       2000
                                                                      50K
       1000
                                                                      100K
          0
              -   20,000   40,000   60,000   80,000 100,000 120,000
                    Number of Sequences Sampled
Rarefaction, Sample Size
with minimal OTU Inflation
       PML SLP-PW-AL
Cluster to Reference

1.  Create a comprehensive set of Cluster
    Representatives (e.g., new Greengenes) representing
    the breadth of Bacteria
2.  Assign each sequence to ClusterRep <= W
3.  If Seq is not a member of any cluster, set aside
4.  Cluster denovo the set of extra-cluster sequences
Advantages of
 clustering to full-length reference

•  Not as prone to OTU inflation
•  Can add new data as available
•  Provides static Cluster IDs
  –  Can be used to compare short reads from
     different regions (v3-v5 and v6)
  –  Can compare with other projects using same Ref
     Set
Oligotyping
•  Further differentiation within closely related organisms
   (e.g., genus)
•  Rather than blanket 3% clustering, select sequence
   positions with the most information (Shannon Entropy)
                  Fusobacterium oligtypes across oral sites




                                                                      supragingival
                  hard palate




                                                        subgingival
                                keratinized
         mucosa




                                                                                               dorsum
                                  gingiva




                                                                         plaque




                                                                                               tongue
         buccal




                                                          plaque
                                              tonsils




                                                                                      saliva




                                                                                                        throat
“But I’m not interested in the
       rare biosphere,
  only the major players.

Can’t I just remove the low
   abundance OTUs?”
900
            350
           7000
            800                   A small number of highly
            300
           6000                   abundant organisms
            700
Count in OTU


            250
           5000
Count in OTU


            600
            200
           4000
            500
            400
           3000
            150
            300                                                                                           A large number of low
           2000
            100                                                Rare Biosphere                             abundance organisms
            200
           1000
             50
            100
              0
              0
                         0
                         0             50 20 100
                                       50    100                   150
                                                                  40               200
                                                                                     60           250 80300                     350
                                                                                                                                100
                                                                     OTU Rank
                                                                         Rank

                             Consistent community profile
                           across samples and environments
               Sogin et al, 2006. Microbial diversity in the deep sea and the underexplored “rare biosphere” PNAS 103: 12115-12120
Distribution of OTU relative abundances
     across 210 HMP stool samples




                               Huse et al. (2012) PLoS ONE
Distribution of OTU Absolute Abundances
          in EnglishEnglish Channel Water Abundances
                          Channel Water Samples
                  Distribution of OTU Absolute
                      in                       Samples
OTUs




                                   Frequency in PML Samples

              Absent   Singleton   Doubleton   3-5   6-10   11-50   51-500   >500
Everything may not be everywhere,

  but everything is rare somewhere!



If you feel you must remove low abundance OTUs,
         don’t do it until you have clustered
                ALL of your samples
Alpha and Beta Diversity:

Impacts of Sampling Depth
  and Diversity Algorithm
Alpha Diversity - Richness
1,800
1,600
1,400                                                                  CL - ACE
1,200                                                                  SLP - ACE

1,000                                                                  CL - Chao
                                                                       SLP - Chao
 800
                                                                       1 in 5000
 600                                                                   1 in 2500
 400                                                                   1 in 1000
                                                                       1 in 500
 200
   -
        -   5,000   10,000 15,000 20,000 25,000 30,000 35,000 40,000




 Alpha diversity metrics are sensitive to cluster
  method, sequencing depth and rare OTUs
Sampling Depth and Alpha Diversity
            5
            4
            4
            3
Diversity




            3
            2
            2
            1
            1
            0
                -       5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000
                                           Sampling Depth

                    SLP - NPShannon    SLP - Simpson    CL - NPShannon    Simpson


                       Robust to both singletons and depth
Comparing Different
           Sampling Depths
The “population” is a set of 50,000 reads from one sample

The “samples” are randomly-selected subsets of sizes:
   1,000           15,000
   5,000           20,000
   7,500           25,000
   10,000

Calculate subsample diversity estimates across subsample
   depths which are representing the same population.
Community Distance of Subsamples
                     0.12


                      0.1
Community Distance




                     0.08


                     0.06


                     0.04


                     0.02


                       0
                                                               Replicates

                       Bray Curtis (1K)     Bray-Curtis (5K)    Morisita Horn (1K)   Morisita Horn (5K)


Subsample 1,000 and 5,000 reads from sample of 50,000 reads,
      Pairwise distances for replicates at single depth
Effect of Sample Depth - Bray Curtis


                                Nearly 100% Different
1.000
0.900
0.800
0.700
0.600
0.500
0.400
 0.300
                                                                 25000
 0.200                                                          20000
                                                               15000
 0.100
                                                             10000
 0.000                                                     7500
                                                          5000
         1000 5000
                     7500 10000                         1000
                                15000 20000
                                            25000

           Bray Curtis uses absolute counts,
  intra-community distances are high as depths diverge
Effect of Sample Depth - Morisita Horn



 0.009                                     Nearly 0.5% Different
 0.008
 0.007
  0.006
  0.005
  0.004
  0.003                                                                      25,000
                                                                            20,000
  0.002                                                                   15,000
  0.001                                                                  10,000
  0.000                                                                7,500
                                                                     5,000
          1,000   5,000                                            1,000
                          7,500
                                  10,000
                                            15,000
                                                     20,000

  Beta diversity metric that uses relative abundances and
          compensates for different sample sizes.
Distances are low across depths above min.sampling depth.
SLP Clustering and Bray-Curtis
0.4


0.3


0.2                                                            1,000
                                                               2,000
                                                               5,000
0.1
                                                               7,500
   PC 2




                                                               10,000
  0
                                                               15,000
                                                               20,000
-0.1                                                           25,000
                                                               30,000
-0.2                                                           40,000


-0.3
                              PC 1
    -0.4   -0.2      0         0.2      0.4        0.6   0.8


   Bray-Curtis PCoA clusters entirely on depth
   (each point represents 10 atop one another)
&'!#"()*+,-./0#1.+2#34-.*.+5#64-/#
  "#"$&

"#""'&

"#""(&

"#"")&

"#""%&
                                                                              &$+"""&&

     "&                                                                       &*+"""&&
     !"#$#




                                                                              &,+*""&&
!"#""%&                                                                       &$"+"""&&
                                                                              &$*+"""&&
!"#"")&
                                                                              &%"+"""&&
!"#""(&                                                                       &%*+"""&&

!"#""'&

 !"#"$&

!"#"$%&
                                              !"#%#
     !"#"$*&        !"#"$&          !"#""*&           "&     "#""*&   "#"$&




                  Minimum sample depth here of 10,000,
             but will be a function of the diversity of the sample
Acknowledgements

  The Josephine Bay Paul Center
   for Comparative Molecular Biology and Evolution

Mitch Sogin
                              Andy Voorhis
       Anna Shipunova
                                    David Mark Welch
              A. Murat Eren
                                        Hilary Morrison
                     Joe Vineis
                                              Sharon Grim
Why filter infrequent errors?

              Average 454        Errors /    Percent of
   Ns
               Error Rate         400nt       Reads

0 or more        0.40%              1.6          100%
    0            0.40%              1.6          99.3%

If we include all reads with or without Ns,
        we have an overall error rate of 0.4%.
If, however we remove all <1% of sequences with Ns,
       we have an overall error rate of 0.4%.
Why bother??
                                                          454
Why filter infrequent errors?
             Average Error     Errors /   Percent of
    Ns            Rate          400nt      Reads
    0           0.40%            1.6        99.3%
    1            1.11%           3.1        0.57%
    2           3.81%            8.7        0.1%
    3           7.26%            16.5       0.0%
    4           8.40%            19.2       0.0%
    5           10.46%           25.1       0.0%

      It’s not just improving the overall error rate,
                but removing spurious data
Low-quality reads can be interpreted as unique organisms:
   0.7% of 500,000 reads = 3,500 “unique organisms”
454 Error Distribution
    Distribution of errors in short reads (<100nt)




                         Most reads contain
                          no errors at all




454 Errors are not evenly distributed among reads:
Many reads have only a small number of errors, and
a small number of reads have many errors             454
A good beginning
        can mask a bad end
If 450 nt read and first 400nt average 35:

if last 50 have an average of 0
       avg qual = ((400*35) + (50*0)) / 450
                 = 31
if last 100 have an average of 25
       avg qual = ((350*35) + (100*25)) / 500
                = 30
Longer reads,
pushing the limits
454 Filter Summary
                    Percent     Average     Average
                    of Reads   Error Rate   Errors /
                                             400 nt
N=0                   99%        0.40%        1.6
N>=1                  1%         0.91%        3.6


Exact Primer          95%        0.38%        1.5
Not Exact Primer      5%         0.84%        3.4


Average Qual >=30     98%        0.90%        3.6
Average Qual <30      2%         1.3%         5.2

                                                       454
454 Filter Summary (cont)
                  Percent     Average     Average
                  of Reads   Error Rate   Errors /
                                           400 nt
Read Length        99+%        0.39%        1.6
(500 - 600nt)
Read Length        0.1%        1.8%         7.2
(<500, >600 nt)


Filtered            93%        0.36%        1.4
Unfiltered          7%         0.64%        2.6



                                                     454
Evaluating Chimeras (USearch)
Parent A
  Query
Parent B




Diffs: A,B: Q matches expected P a,b: Q matches other P          p: A=B!=Q
Votes: + for Model,    0 neutral,    ! against Model
Model: shows extent of Parent A and Parent B, xxxx is overlap matching A&B
Initial Length: 277


                                            Extent of your
                                             sequence




Click on the bar to
see the alignment

                                               Extent of your
                                                   match
Check for left and right parents:
BLAST the left (1-175)
BLAST the right (175 - 277)
100% Match to
      Fusobacterium
1



            175




      100% Match to
      Pseudomonas
175

      277
Taxonomic Names

•  Bergey’s Taxonomic Outline – manual of
   taxonomic names for bacteria
•  List of Prokaryotic names with Standing in the
   Nomenclature (vetting process)
•  NCBI – similar taxonomy, but multiple
   “subs” (subclass, suborder, subfamily, tribe)
•  Archaea – a work in progress…
•  Fungi – another work in progress…
Cluster “Width”
Diameter            Radius
  Sequences are      Sequences are
  never more than    never more than
  D apart.           R from seed.
  (CL)               (SL, AL, Gr)
Average Linkage
                                collapses errors

              Cluster	
  Count:	
  	
   1	
  




                                                 #1	
  




    Clusters	
  tend	
  to	
  be	
  heavily	
  dominated	
  by	
  their	
  most	
  abundant	
  
sequence,	
  which	
  strongly	
  weights	
  the	
  average	
  and	
  smoothes	
  the	
  noise.	
  	
  
Still lose outlier
           sequencing errors




Multiple sequencing errors still not clustered
Inflation in Action:
 Multiple Sequence Alignment
and Complete Linkage clustering



                      1,042 is a few more
                      than the expected 2
Example MSA




  Regardless of clustering algorithm,
 an MSA cannot fully align tags whose
     sequences are too divergent




18,156 sequences and 392 positions
Relative Inflation

                     Absolute
                     number of
                     errant OTUs
                     will increase
                     with sample
                     size.

                     Relative
                     number of
                     errant OTUs
                     will
                     descrease
                     with sample
                     complexity
The
   Magical 3%

3% SSU OTUs = Species
        and
6% SSU OTUs = Genera


   NOT!
Clustering Questions

•  How meaningful are clusters functionally?
•  When is an errare rare and when is it an error?
•  Should it be included in an existing cluster or start its
   own?
•  How to place sequences if OTUs overlap?
•  What is the effect of residual low quality data or
   chimeras?
•  How sensitive are alpha and beta diversity estimates to
   clustering results?

Contenu connexe

En vedette

Waypoint - Student Outcomes
Waypoint - Student OutcomesWaypoint - Student Outcomes
Waypoint - Student Outcomesfscjopen
 
IAE:20121128新機能マニュアル(Pinboard)
IAE:20121128新機能マニュアル(Pinboard)IAE:20121128新機能マニュアル(Pinboard)
IAE:20121128新機能マニュアル(Pinboard)Yuichi Takahashi
 
Open Campus Spring Newsletter
Open Campus Spring NewsletterOpen Campus Spring Newsletter
Open Campus Spring Newsletterfscjopen
 
Statistics%20 presentation(1)1
Statistics%20 presentation(1)1Statistics%20 presentation(1)1
Statistics%20 presentation(1)1fscjopen
 
Students with Disabilities
Students with Disabilities Students with Disabilities
Students with Disabilities fscjopen
 
Students with disabilities
Students with disabilitiesStudents with disabilities
Students with disabilitiesfscjopen
 
Did You Know
Did You KnowDid You Know
Did You Knowfscjopen
 
VAMPS Initiative
VAMPS Initiative VAMPS Initiative
VAMPS Initiative DavidCoil
 
Student conduct
Student conductStudent conduct
Student conductfscjopen
 
IAE2011: Request for Comments 機能説明
IAE2011: Request for Comments 機能説明IAE2011: Request for Comments 機能説明
IAE2011: Request for Comments 機能説明Yuichi Takahashi
 
Student and Faculty Awards Ceremony
Student and Faculty Awards CeremonyStudent and Faculty Awards Ceremony
Student and Faculty Awards Ceremonyfscjopen
 
Open Campus CRC
Open Campus CRCOpen Campus CRC
Open Campus CRCfscjopen
 
Other News and Events
Other News and EventsOther News and Events
Other News and Eventsfscjopen
 

En vedette (18)

Waypoint - Student Outcomes
Waypoint - Student OutcomesWaypoint - Student Outcomes
Waypoint - Student Outcomes
 
IAE:20121128新機能マニュアル(Pinboard)
IAE:20121128新機能マニュアル(Pinboard)IAE:20121128新機能マニュアル(Pinboard)
IAE:20121128新機能マニュアル(Pinboard)
 
Open Campus Spring Newsletter
Open Campus Spring NewsletterOpen Campus Spring Newsletter
Open Campus Spring Newsletter
 
Class of 1997
Class of 1997Class of 1997
Class of 1997
 
Statistics%20 presentation(1)1
Statistics%20 presentation(1)1Statistics%20 presentation(1)1
Statistics%20 presentation(1)1
 
Students with Disabilities
Students with Disabilities Students with Disabilities
Students with Disabilities
 
Students with disabilities
Students with disabilitiesStudents with disabilities
Students with disabilities
 
Did You Know
Did You KnowDid You Know
Did You Know
 
VAMPS Initiative
VAMPS Initiative VAMPS Initiative
VAMPS Initiative
 
Student conduct
Student conductStudent conduct
Student conduct
 
IAE2011: Request for Comments 機能説明
IAE2011: Request for Comments 機能説明IAE2011: Request for Comments 機能説明
IAE2011: Request for Comments 機能説明
 
Student and Faculty Awards Ceremony
Student and Faculty Awards CeremonyStudent and Faculty Awards Ceremony
Student and Faculty Awards Ceremony
 
Open Campus CRC
Open Campus CRCOpen Campus CRC
Open Campus CRC
 
test slides 4
test slides 4test slides 4
test slides 4
 
Criação de Deus
Criação de DeusCriação de Deus
Criação de Deus
 
TechTalkThai-CiscoHyperFlex
TechTalkThai-CiscoHyperFlexTechTalkThai-CiscoHyperFlex
TechTalkThai-CiscoHyperFlex
 
Welcome
WelcomeWelcome
Welcome
 
Other News and Events
Other News and EventsOther News and Events
Other News and Events
 

Marker Gene Analysis: Best Practices

  • 1. Marker Gene Analysis Best Practices Susan Huse Marine Biological Laboratory / Brown University October 17, 2012
  • 2. Cleaning Data Filtering: Remove reads that are likely to be overall low-quality and have errors throughout the read. Quality Trimming: trim off nucleotides from the end(s) of the read based on local quality values. Denoising: Adjust nucleotides that are more likely to be an error in base-calling (noise) than a true low-frequency variation (signal) Anchor Trimming: trim the end of long amplicons to a conserved location in the SSU alignment Chimera Removal: remove hybrid sequences created during amplification
  • 3. Recommended 454 Filtering •  Exact match to barcode and proximal primer •  Optional denoising (currently only 454) •  Remove sequences –  with Ns –  that are too short –  Below average or window quality threshold •  Trim to distal primer or anchor –  Remove sequences without anchor / primer
  • 4. SSU rRNA Anchor Trimming Next-gen sequences often do not reach to the distal primer, and reads may have a range of lengths. De novo OTU clustering and other sequence comparisons are more consistent if all tags are trimmed to the same start and stop positions in the rRNA alignment. Anchor trimming uses a highly conserved location situated within the read length and truncates all reads to that position. Be careful that the anchor is the unique and present across all taxa.
  • 5. An Illumina HiSeq Error Distribution Quality Scores for Error Positions 100% 90% Cumulative Percent of Errors 80% 70% 60% 80% of error bases have 50% a quality score <=16 40% 30% 20% 10% 0% 0 5 10 15 20 25 30 35 40 Quality Score Untrimmed Data Before trimming, most errors have low Q scores
  • 6. HiSeq Reads with Ns NTAGCACCAAACATAAATCACCTCACTTAAGTGGCTGGAGACAAATAATCTCTTTAATAACCTGATTCAGCGAAACCAATCCGCGGCATTTAGTAGCGGTA! NTAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATG! NGCGCCAATATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTGATGAACTAAGTCAACCTCAGCACTAACCTTGCGAGTCATTTCTTTGATTTGGTCAT! NGTAAAAATGTCTACAGTAGAGTCAATAGCAAGGCCACGACGCAATGGAGAAAGACGGAGAGCGCCAACGGCGTCCATCTCGAAGGAGTCGCCAGCGATAA! NTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTC! CAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTNTNNNNNAATNNNNNNNNNNNNNNNNNNNNNNNCANNNNNTNGNNNNANNNNNTTGAGTGTGAGGT! CGGATTGTTCAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTNAACANNNNNNNNNNNNNATAGTAATCCACGCTCTTNTAANATGTCAACAAGAG! TATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGAATTTTACCAATGACCANNNCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAG! TAGAAGTCGTCATTTGGCGAGAAAGCTCAGTCTCAGGAGGAAGCGGAGCAGTCCAAANNNTTTTGAGATGGCAGCAACGGAAACCATAACGAGCATCATCT! TGCTGTTGAGTGGTCTCATGACAATAAAGTATGTCNCTGNNTTGAAGNNTNNNNNNNNNNNNNNNCTNATACAATCACGCNCANNNNNAAAAGTGTCGTGT! CTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGNCTTANNNNNNNNNNNNTGGCGACCCTGTTTTGTATGGCANCTTGCCGCCGCGT! CGGCAGAAGCCTGAATGAGCTTAATAGAGGCCAAAGCGGTCTGGAAACGTACGGATTNNNNAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTGAA! GTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGNTGGTNNCNNNNNNNNNAAATTGTTTGGAGGCGGTCAAAANGCCGCCTCCGGTG! ATATCAACCACACCAGAAGCAGCATCAGTGACGACATTAGAAATATCCTTTGNAGTNNNNNNNNTATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTG! ! In this dataset: •  68 reads contained at least 1 N, of these: •  14 (21%) could not be mapped to PhiX, •  7 of those 14 (50%) had only 1 N •  24 (35%) contain more than 1 N Illumina
  • 7. Minoche Filtering for Illumina Table 2: Expected error rates based on Q-scores (% of bases lost) No filter Illumina Chastity (ChF) Low-Quality (B) tails Ns <1/3 of nt Q<30 in 1st half avgQ < 30 1st 30% of nt All filters Minoche A, et al. 2011. Genome Biology 12: R112 using Bambus vulgaris, Arabidopsis thaliana, and PhiX
  • 8. Remaining Errors Quality Scores for Error Positions 100% 90% 80% 70% PCR errors? Pct of Errors 60% 50% 40% 30% 20% 10% 0% 0 5 10 15 20 25 30 35 40 Quality Score Trimmed Data Untrimmed Data Illumina
  • 9. QIIME Illumina Pipeline •  Single mismatch to barcode •  Trim read to last position above quality threshold q •  Remove sequences less than length threshold p •  Remove sequences with more than n Ns
  • 10. Paired-End Filtering A small insert size allows for sequence overlap Read 1 (forward) Area of sequence overlap Read 2 (reverse) Keep only reads that match exactly throughout the region of overlap. Amplicons designed to completely overlap (e.g., V6) ensure the highest quality sequences.
  • 11. But Variation Still Exists E. coli K-12 V6 paired end with complete perfect overlap ACAATCTGT G C T CAG ACT TC AGAGAT GA TG TG C TCG G ACTGTGAGA C AA A T C TCCAG G A C A T G T C T C AGA TT T G T C C G A GG T C C C A T A GA G A GG T T T CA A TC G A AGAGT T GC C A A C A T G T CC G A A T AA C C A GGT GT A C ACA A GA C GA C T G T 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 5′ 3′ weblogo.berkeley.edu Is this: 1.  systematic bidirectional sequencing error (unlikely) 2.  PCR error, or 3.  natural variation?
  • 12. What are Chimeras and How do we find them?
  • 13. 5’ PCR primer primer anneals 3’ to complementary target 5’ Extension creates 3’ double-stranded amplicon But… 5’ 3’ Premature dissociation 3’ terminates elongation conserved 5’ region The incomplete strand binds to a 3’ 3’ different template at a conserved region… 5’ 3’ …then extends to create a chimera 5’ The chimera can act as a template 3’ during the next PCR round.
  • 14. Chimera Detection 1.  Look for the best match to the left (left parent) Parent A Chimeric Read 2.  Look for the best match to the right (right parent) Chimeric Read Parent B 3.  Compare the distance between the two parents – are they really different or multiple entries for the same organism Parent A Parent B
  • 15. Detection methods differ by source of parents 1.  Reference Comparison: check against known reference sequences 2.  De novo detection: check all triplets in your amplification
  • 16. Reference Comparison only as good as the Ref Set •  Can only find parents if they are in the RefSet •  Any chimeras in the Ref Set are deleterious! •  Sparse RefSet may not detect chimeras from closely related organisms (intra-genera, intra-species) •  Differential density of the Ref Set can create biases •  Poor matches to the Ref Set can be mistaken for chimeras •  Hard to detect if parents are similar, but may not matter
  • 17. De Novo Pros and Cons •  Can detect parents not in the RefSet: novel, close neighbors, PCR errors, unexpected amplifications •  Must be run by amplification , ie. by tube All your parents but only your parents •  Abundance profile can be tricky with long tail •  Early False Positives (parent is lost to RefSet) and False Negatives (chimera add to RefSet) will affect downstream calls We use both de novo and ref
  • 18. Rates of Chimera Formation in BPC Datasets As a function of total reads,Various Datasets Percent Chimeric for not unique sequences 70% 60% Percenct of Datasets 50% 40% 30% 20% 10% 0% 0% 10% 20% 30% 40% 50% Percent of Reads that are Chimeric V6V4 V3V5
  • 19. Chimera detection programs optimized for short reads •  UChime (in USearch, QIIME and VAMPS) •  Perseus (in AmpliconNoise and mothur)
  • 20. Aggregating Downstream analytical techniques that compensate for inaccuracies in the remaining sequence data. Taxonomic assignments will generally remain the same despite a few mismatches. More so at coarser taxonomic levels (class vs. genus) OTU Clustering can round out small percentages of errors depending on the algorithm used. Clustering at 3% can (but does not always!) aggregate sequences with 1 – 2% errors. “Aggregating” is not accepted terminology in the field
  • 21. Taxonomic Filtering In addition to knowledge base associated with taxnomic names: •  Can filter many unintended PCR amplification products. •  Reads too far from the tree can be classified as “Unknown” and examined further. •  Important to map reads to all domains, not just Bacteria, primers can amplify across domains and organelles
  • 22. Amplification of other Domains SSU Total Archaea Bacteria Organelle Unknown region Reads V6 529,359 0.02% 96% 4% 0.1% V6-V4 3,437,855 0.3% 87% 8% 4% Samples from Little Sippewissett Marsh. Organelles include mitochondria and chloroplasts
  • 23. Non SSU rRNA Amplification Conserved inner membrane protein cardiolipin synthase DNA binding transcriptional dual Predicted regulator, tyrosine- antibiotic 16S rRNA binding transporter Putative transport system permease 16S rRNA protein Predicted major pilin subunit Thank you, Hilary
  • 24. Taxonomy GAST: Global Alignment of Sequence Taxonomy Use sequence alignment to compare against a RefSet Distance = alignment distance to nearest RefSet sequence (SILVA, Greengenes, Stajich Refs, UNITE, HOMD, etc) (VAMPS) RDP: Ribosomal Database Project Uses k-mer matching to find nearest genus Boot strap values reflect confidence in the assignment (RDP Training set, Greengenes, etc.) (QIIME, VAMPS)
  • 25. Sources of Error in Taxonomic Analyses •  Primer bias •  Chimeras •  Discovery of novel 16S •  Unrepresented in reference database •  Low-quality references •  Taxonomy not available •  Incorrect taxonomy in RefSet •  Ambiguous hypervariable sequence (>1 hit) •  RefSets often biased toward most studied
  • 26. Creating OTUs: Operational Taxonomic Units for taxonomy independent analyses
  • 27. OTUs vs Taxonomy •  Novel organisms •  Many unnamed organisms •  Some clades only defined to phyla or class •  Many species names based on phenotype rather than genotype •  Do not lump together all 16S “unknowns” or diverse partially classified.
  • 28. Clustering Algorithms Different clustering algorithms can have very different effects on the size and number of OTUs created…
  • 29. Clustering Methods De novo (open) •  greedy clusters - test sequentially and incorporate sequence into first qualifying OTU. Dependent on input order. •  average linkage - the average distance from a sequence to every other sequence in the OTU is less than the width. Dependent on input order. [complete and single linkage are other methods] Reference (closed) •  greedy - map each sequence to representative sequences defining prebuilt clusters
  • 30. The Problem of OTU Inflation De novo clustering algorithms return more OTUs than predicted for mock communities. OTU inflation leads to: •  alpha diversity inflation •  beta diversity inflation Where does this inflation come from? •  residual sequencing errors, •  chimeras, •  multiple sequence alignments, •  clustering algorithms
  • 31. Rarefaction, Sample Size under OTU Inflation M2FN PML MS-CL - PML Rarefaction 7000 6000 5000 5K 4000 OTUs 10K 3000 15K 20K 2000 50K 1000 100K 0 - 20,000 40,000 60,000 80,000 100,000 120,000 Number of Sequences Sampled
  • 32. Rarefaction, Sample Size with minimal OTU Inflation PML SLP-PW-AL
  • 33. Cluster to Reference 1.  Create a comprehensive set of Cluster Representatives (e.g., new Greengenes) representing the breadth of Bacteria 2.  Assign each sequence to ClusterRep <= W 3.  If Seq is not a member of any cluster, set aside 4.  Cluster denovo the set of extra-cluster sequences
  • 34. Advantages of clustering to full-length reference •  Not as prone to OTU inflation •  Can add new data as available •  Provides static Cluster IDs –  Can be used to compare short reads from different regions (v3-v5 and v6) –  Can compare with other projects using same Ref Set
  • 35. Oligotyping •  Further differentiation within closely related organisms (e.g., genus) •  Rather than blanket 3% clustering, select sequence positions with the most information (Shannon Entropy) Fusobacterium oligtypes across oral sites supragingival hard palate subgingival keratinized mucosa dorsum gingiva plaque tongue buccal plaque tonsils saliva throat
  • 36. “But I’m not interested in the rare biosphere, only the major players. Can’t I just remove the low abundance OTUs?”
  • 37. 900 350 7000 800 A small number of highly 300 6000 abundant organisms 700 Count in OTU 250 5000 Count in OTU 600 200 4000 500 400 3000 150 300 A large number of low 2000 100 Rare Biosphere abundance organisms 200 1000 50 100 0 0 0 0 50 20 100 50 100 150 40 200 60 250 80300 350 100 OTU Rank Rank Consistent community profile across samples and environments Sogin et al, 2006. Microbial diversity in the deep sea and the underexplored “rare biosphere” PNAS 103: 12115-12120
  • 38. Distribution of OTU relative abundances across 210 HMP stool samples Huse et al. (2012) PLoS ONE
  • 39. Distribution of OTU Absolute Abundances in EnglishEnglish Channel Water Abundances Channel Water Samples Distribution of OTU Absolute in Samples OTUs Frequency in PML Samples Absent Singleton Doubleton 3-5 6-10 11-50 51-500 >500
  • 40. Everything may not be everywhere, but everything is rare somewhere! If you feel you must remove low abundance OTUs, don’t do it until you have clustered ALL of your samples
  • 41. Alpha and Beta Diversity: Impacts of Sampling Depth and Diversity Algorithm
  • 42. Alpha Diversity - Richness 1,800 1,600 1,400 CL - ACE 1,200 SLP - ACE 1,000 CL - Chao SLP - Chao 800 1 in 5000 600 1 in 2500 400 1 in 1000 1 in 500 200 - - 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 Alpha diversity metrics are sensitive to cluster method, sequencing depth and rare OTUs
  • 43. Sampling Depth and Alpha Diversity 5 4 4 3 Diversity 3 2 2 1 1 0 - 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 Sampling Depth SLP - NPShannon SLP - Simpson CL - NPShannon Simpson Robust to both singletons and depth
  • 44. Comparing Different Sampling Depths The “population” is a set of 50,000 reads from one sample The “samples” are randomly-selected subsets of sizes: 1,000 15,000 5,000 20,000 7,500 25,000 10,000 Calculate subsample diversity estimates across subsample depths which are representing the same population.
  • 45. Community Distance of Subsamples 0.12 0.1 Community Distance 0.08 0.06 0.04 0.02 0 Replicates Bray Curtis (1K) Bray-Curtis (5K) Morisita Horn (1K) Morisita Horn (5K) Subsample 1,000 and 5,000 reads from sample of 50,000 reads, Pairwise distances for replicates at single depth
  • 46. Effect of Sample Depth - Bray Curtis Nearly 100% Different 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 25000 0.200 20000 15000 0.100 10000 0.000 7500 5000 1000 5000 7500 10000 1000 15000 20000 25000 Bray Curtis uses absolute counts, intra-community distances are high as depths diverge
  • 47. Effect of Sample Depth - Morisita Horn 0.009 Nearly 0.5% Different 0.008 0.007 0.006 0.005 0.004 0.003 25,000 20,000 0.002 15,000 0.001 10,000 0.000 7,500 5,000 1,000 5,000 1,000 7,500 10,000 15,000 20,000 Beta diversity metric that uses relative abundances and compensates for different sample sizes. Distances are low across depths above min.sampling depth.
  • 48. SLP Clustering and Bray-Curtis 0.4 0.3 0.2 1,000 2,000 5,000 0.1 7,500 PC 2 10,000 0 15,000 20,000 -0.1 25,000 30,000 -0.2 40,000 -0.3 PC 1 -0.4 -0.2 0 0.2 0.4 0.6 0.8 Bray-Curtis PCoA clusters entirely on depth (each point represents 10 atop one another)
  • 49. &'!#"()*+,-./0#1.+2#34-.*.+5#64-/# "#"$& "#""'& "#""(& "#"")& "#""%& &$+"""&& "& &*+"""&& !"#$# &,+*""&& !"#""%& &$"+"""&& &$*+"""&& !"#"")& &%"+"""&& !"#""(& &%*+"""&& !"#""'& !"#"$& !"#"$%& !"#%# !"#"$*& !"#"$& !"#""*& "& "#""*& "#"$& Minimum sample depth here of 10,000, but will be a function of the diversity of the sample
  • 50. Acknowledgements The Josephine Bay Paul Center for Comparative Molecular Biology and Evolution Mitch Sogin Andy Voorhis Anna Shipunova David Mark Welch A. Murat Eren Hilary Morrison Joe Vineis Sharon Grim
  • 51. Why filter infrequent errors? Average 454 Errors / Percent of Ns Error Rate 400nt Reads 0 or more 0.40% 1.6 100% 0 0.40% 1.6 99.3% If we include all reads with or without Ns, we have an overall error rate of 0.4%. If, however we remove all <1% of sequences with Ns, we have an overall error rate of 0.4%. Why bother?? 454
  • 52. Why filter infrequent errors? Average Error Errors / Percent of Ns Rate 400nt Reads 0 0.40% 1.6 99.3% 1 1.11% 3.1 0.57% 2 3.81% 8.7 0.1% 3 7.26% 16.5 0.0% 4 8.40% 19.2 0.0% 5 10.46% 25.1 0.0% It’s not just improving the overall error rate, but removing spurious data Low-quality reads can be interpreted as unique organisms: 0.7% of 500,000 reads = 3,500 “unique organisms”
  • 53. 454 Error Distribution Distribution of errors in short reads (<100nt) Most reads contain no errors at all 454 Errors are not evenly distributed among reads: Many reads have only a small number of errors, and a small number of reads have many errors 454
  • 54. A good beginning can mask a bad end If 450 nt read and first 400nt average 35: if last 50 have an average of 0 avg qual = ((400*35) + (50*0)) / 450 = 31 if last 100 have an average of 25 avg qual = ((350*35) + (100*25)) / 500 = 30
  • 56. 454 Filter Summary Percent Average Average of Reads Error Rate Errors / 400 nt N=0 99% 0.40% 1.6 N>=1 1% 0.91% 3.6 Exact Primer 95% 0.38% 1.5 Not Exact Primer 5% 0.84% 3.4 Average Qual >=30 98% 0.90% 3.6 Average Qual <30 2% 1.3% 5.2 454
  • 57. 454 Filter Summary (cont) Percent Average Average of Reads Error Rate Errors / 400 nt Read Length 99+% 0.39% 1.6 (500 - 600nt) Read Length 0.1% 1.8% 7.2 (<500, >600 nt) Filtered 93% 0.36% 1.4 Unfiltered 7% 0.64% 2.6 454
  • 58. Evaluating Chimeras (USearch) Parent A Query Parent B Diffs: A,B: Q matches expected P a,b: Q matches other P p: A=B!=Q Votes: + for Model, 0 neutral, ! against Model Model: shows extent of Parent A and Parent B, xxxx is overlap matching A&B
  • 59. Initial Length: 277 Extent of your sequence Click on the bar to see the alignment Extent of your match
  • 60. Check for left and right parents: BLAST the left (1-175) BLAST the right (175 - 277)
  • 61. 100% Match to Fusobacterium 1 175 100% Match to Pseudomonas 175 277
  • 62. Taxonomic Names •  Bergey’s Taxonomic Outline – manual of taxonomic names for bacteria •  List of Prokaryotic names with Standing in the Nomenclature (vetting process) •  NCBI – similar taxonomy, but multiple “subs” (subclass, suborder, subfamily, tribe) •  Archaea – a work in progress… •  Fungi – another work in progress…
  • 63. Cluster “Width” Diameter Radius Sequences are Sequences are never more than never more than D apart. R from seed. (CL) (SL, AL, Gr)
  • 64. Average Linkage collapses errors Cluster  Count:     1   #1   Clusters  tend  to  be  heavily  dominated  by  their  most  abundant   sequence,  which  strongly  weights  the  average  and  smoothes  the  noise.    
  • 65. Still lose outlier sequencing errors Multiple sequencing errors still not clustered
  • 66. Inflation in Action: Multiple Sequence Alignment and Complete Linkage clustering 1,042 is a few more than the expected 2
  • 67. Example MSA Regardless of clustering algorithm, an MSA cannot fully align tags whose sequences are too divergent 18,156 sequences and 392 positions
  • 68. Relative Inflation Absolute number of errant OTUs will increase with sample size. Relative number of errant OTUs will descrease with sample complexity
  • 69. The Magical 3% 3% SSU OTUs = Species and 6% SSU OTUs = Genera NOT!
  • 70. Clustering Questions •  How meaningful are clusters functionally? •  When is an errare rare and when is it an error? •  Should it be included in an existing cluster or start its own? •  How to place sequences if OTUs overlap? •  What is the effect of residual low quality data or chimeras? •  How sensitive are alpha and beta diversity estimates to clustering results?