16S classifier

16S Classifier: a tool for fast and accurate
classification of 16S rRNA sequences
Ashok K. Sharma
Research Scholar
Metagenomics and Systems Biology Laboratory
Indian Institute of Science Education and Research, Bhopal

What
How
Who
Species Diversity
Overview
Arcobacter
Paludibacter
Shewanella
Pseudomonas
Unknown
Species Richness
Metagenome
 Microbial diversity of soil and other
extreme environments are still limited
 Only 1-3% of soil microbes are culturable
 Estimated in 1g of soil = 4000- 5000
different bacterial “genomic units”
 Bacteria and fungi plays an important
role in biogeochemical cycles, and
specially in human health

Methods of studying microbial diversity
Biochemical
• Plate count
• Community level
physiological profiling
• Fatty acid methyl ester
analysis: as fatty acids make
up constant proportion of
cell biomass
Molecular
• G+C content
• Nucleic acid re-association
and hybridization
• DNA microarray
• DNA cloning and
sequencing-based methods

Metagenomic reads vs 16S rRNA for microbial
diversity identification
Metagenome
DNA Isolation
Fragmentation
of DNA
Metagenomic Reads
Amplification of
16S rRNA
16S rRNA from multiple species
Microbial diversity
Tools: Kraken, PhylopathiaS,
Phymm, phymmBL,
Metabin
Microbial diversity

16S rRNA – a “gold standard” for microbial
molecular identification
• Universal
• Highly conserved
• Long enough (~1500 bp) to provide significant discrimination
between many species
• Structural information can guide alignment and phylogenetic
reconstruction
• Many species now represented in the database
16S rRNA gene
sequencing
Earlier  By sequencing whole gene
Now  By sequencing short variable regions
Limitations:
• Insufficient and
underestimated diversity

16S rRNA: to understand microbial diversity
Community composition shifts over time as revealed
by 16S data

Software and tools available for the analysis of
16S rRNA data
• CloVR-16S
• QIIME – a Python-based workflow package, allowing for sequence
processing and phylogenetic analysis using different methods including
the phylogenetic distance metric UniFrac, UCLUST, PyNAST and the RDP
Bayesian classifier;
• Mothur – a C++-based software package for 16S analysis;
• Metastats and custom R scripts used to generate additional statistical and
graphical evaluations.
• Most recent: 16S Classifier – Random forest based standalone package
specially for short hypervariable regions

Material and methods
• Green genes database
• Random forest
• Emboss
• RDP Classifier
• BLAST

Input Data for Training
In 16S Classifier, we made separate models for different
Hypervariable regions of 16S rRNA gene
 Took Greengenes 16S rRNA database
 Extracted individual HVRs as well as combination of 2 or more commonly
used HVRs using commonly used Universal primers with the help of in-
house perl scripts and EMBOSS software suit
 Discarded HVRs where primer coverage was lesser than 50% of all
sequences
 Clustered out highly similar sequences using CD-hit at threshold 1.

Table 1. Summary of the number of HVR sequences which were
used for the training and testing of RF*.

Parameters optimizations
 Labeled each sequence with its taxonomic information to the lowest
known level except species
 Used V3 region for optimization of parameters
 Calculated 2-mer, 3-mer, 4-mer, 5-mer, 6-mer nucleotide
frequencies and tried them as feature inputs
 Tried various mtry values at each k-mer to get the least OOB error
value
 Got best results at k = 4. So utilized 4-mer nucleotide frequencies
for building models at ntree = 1000.

Figure 1. Optimization of parameters using hypervariable region V3

Input data for testing
 First test dataset was obtained by randomly extracting ~10% of the
sequences which we had clustered out using CD-hit earlier. 1%
random mutations were inserted in these sequences to mimic real
life sequencing errors
 Second dataset was obtained from real metagenomics sequences
available from SRA dataset of NCBI
 Performance of 16S Classifier was compared with that of RDP
Classifier in terms of accuracy as well as time taken for computation.

Performance Of Different RF Models On
Different Hvrs And Complete 16S rRna Gene

Performance Of RF Models On First Test
Dataset

Comparison Of 16S Classifier With RDP
Classifier On Real Datasets

Advantages of 16S Classifier
• Extremely fast
• High sensitivity as well as specificity
• Consistent across various HVRs
• Easy availability
• Easy to deploy and use

How to use
• User can download zip file of a particular hypervariable region or complete 16S,
which is freely available at
http://metagenomics.iiserb.ac.in/16Sclassifier/download.html
• Extract the zipped file which contains a model file (*.Rdata), a script file (*.sh) and
an exe file (16Sclassifier.exe).
• Other dependencies:
User has to install R from the following link http://cran.r-project.org/
intall Randomforest
## Command line usage ##
./16sclassifier.exe <queryfile> <modelname>
The query file should be in Fasta format and the model name could be v2, v3, v4, v5,
v6, v7, v8, v23, v34, v35, v45, v56, v67, v78 and complete.

16S classifier

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à 16S classifier

Similaire à 16S classifier (20)

Dernier

Dernier (20)

16S classifier

Notes de l'éditeur