The diversity of microbial species in a metagenomic study is commonly assessed using 16S rRNA gene sequencing. With the rapid developments in genome sequencing technologies, the focus has shifted towards the sequencing of hypervariable regions of 16S rRNA gene instead of full length gene sequencing. Therefore, 16S Classifier is developed using a machine learning method, Random Forest, for faster and accurate taxonomic classification of short hypervariable regions of 16S rRNA sequence. It displayed precision values of up to 0.91 on training datasets and the precision values of up to 0.98 on the test dataset. On real metagenomic datasets, it showed up to 99.7% accuracy at the phylum level and up to 99.0% accuracy at the genus level. 16S Classifier is available freely at http://metagenomics.iiserb.ac.in/16Sclassifier and http://metabiosys.iiserb.ac.in/16Sclassifier.
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
16S classifier
1. 16S Classifier: a tool for fast and accurate
classification of 16S rRNA sequences
Ashok K. Sharma
Research Scholar
Metagenomics and Systems Biology Laboratory
Indian Institute of Science Education and Research, Bhopal
3. Methods of studying microbial diversity
Biochemical
• Plate count
• Community level
physiological profiling
• Fatty acid methyl ester
analysis: as fatty acids make
up constant proportion of
cell biomass
Molecular
• G+C content
• Nucleic acid re-association
and hybridization
• DNA microarray
• DNA cloning and
sequencing-based methods
4. Metagenomic reads vs 16S rRNA for microbial
diversity identification
Metagenome
DNA Isolation
Fragmentation
of DNA
Metagenomic Reads
Amplification of
16S rRNA
16S rRNA from multiple species
Microbial diversity
Tools: Kraken, PhylopathiaS,
Phymm, phymmBL,
Metabin
Microbial diversity
5. 16S rRNA – a “gold standard” for microbial
molecular identification
• Universal
• Highly conserved
• Long enough (~1500 bp) to provide significant discrimination
between many species
• Structural information can guide alignment and phylogenetic
reconstruction
• Many species now represented in the database
16S rRNA gene
sequencing
Earlier By sequencing whole gene
Now By sequencing short variable regions
Limitations:
• Insufficient and
underestimated diversity
7. 16S rRNA: to understand microbial diversity
Community composition shifts over time as revealed
by 16S data
8. Software and tools available for the analysis of
16S rRNA data
• CloVR-16S
• QIIME – a Python-based workflow package, allowing for sequence
processing and phylogenetic analysis using different methods including
the phylogenetic distance metric UniFrac, UCLUST, PyNAST and the RDP
Bayesian classifier;
• Mothur – a C++-based software package for 16S analysis;
• Metastats and custom R scripts used to generate additional statistical and
graphical evaluations.
• Most recent: 16S Classifier – Random forest based standalone package
specially for short hypervariable regions
9. Material and methods
• Green genes database
• Random forest
• Emboss
• RDP Classifier
• BLAST
10. Input Data for Training
In 16S Classifier, we made separate models for different
Hypervariable regions of 16S rRNA gene
Took Greengenes 16S rRNA database
Extracted individual HVRs as well as combination of 2 or more commonly
used HVRs using commonly used Universal primers with the help of in-
house perl scripts and EMBOSS software suit
Discarded HVRs where primer coverage was lesser than 50% of all
sequences
Clustered out highly similar sequences using CD-hit at threshold 1.
11. Table 1. Summary of the number of HVR sequences which were
used for the training and testing of RF*.
12. Parameters optimizations
Labeled each sequence with its taxonomic information to the lowest
known level except species
Used V3 region for optimization of parameters
Calculated 2-mer, 3-mer, 4-mer, 5-mer, 6-mer nucleotide
frequencies and tried them as feature inputs
Tried various mtry values at each k-mer to get the least OOB error
value
Got best results at k = 4. So utilized 4-mer nucleotide frequencies
for building models at ntree = 1000.
17. Input data for testing
First test dataset was obtained by randomly extracting ~10% of the
sequences which we had clustered out using CD-hit earlier. 1%
random mutations were inserted in these sequences to mimic real
life sequencing errors
Second dataset was obtained from real metagenomics sequences
available from SRA dataset of NCBI
Performance of 16S Classifier was compared with that of RDP
Classifier in terms of accuracy as well as time taken for computation.
21. Advantages of 16S Classifier
• Extremely fast
• High sensitivity as well as specificity
• Consistent across various HVRs
• Easy availability
• Easy to deploy and use
22. How to use
• User can download zip file of a particular hypervariable region or complete 16S,
which is freely available at
http://metagenomics.iiserb.ac.in/16Sclassifier/download.html
• Extract the zipped file which contains a model file (*.Rdata), a script file (*.sh) and
an exe file (16Sclassifier.exe).
• Other dependencies:
User has to install R from the following link http://cran.r-project.org/
intall Randomforest
## Command line usage ##
./16sclassifier.exe <queryfile> <modelname>
The query file should be in Fasta format and the model name could be v2, v3, v4, v5,
v6, v7, v8, v23, v34, v35, v45, v56, v67, v78 and complete.
Notes de l'éditeur
Species diversity consists of:
1. Species richness, 2. Total number of species, and 3. Distribution of species
Plate count is fast and cost effective but having disadvantage of not detection of unculturable microbes, bias towards fast growing, bias towards fungal species
CLPP is fast, highly reproducible, inexpensive and generate large amount of data but having disadvantage of only represent culturable community, favour fast growing
FAME: no culturing needed, directly extracted from soil, but having disadvantage of affecting by external factors.