1. Studying the microbiome
Mick Watson
Head of Bioinformatics, Edinburgh Genomics, University of Edinburgh
Research Group Leader, The Roslin Institute, University of Edinburgh
2. Edinburgh Genomics
• Genomics facility based at the University of Edinburgh
• Available for collaborations on an academic, non-profit basis
• Formed from merger of
– ARK-Genomics
– The GenePool
• Funded by three major bio UK research councils
• A range of technologies and expertise available
http://genomics.ed.ac.uk
3. Prevailing theory of the individual
• An individual consists of at least 10x as many bacterial cells as “host” cells *
• Each individual is a “supra-organism”
– a composite of host and microbial cells contribute the functions necessary for the
individual to survive
• The genetic landscape of any individual is a composite of the host genome
and the genomes of the millions of microbial symbionts that live on and
within that individual
• It is clearly important to take a holistic view when examining any animal
phenotype
My focus
• Move from discovery science to applied science
• “What’s there?” “What can we do with it?”
4. • The “ten times” figure
comes from a paper in
1972, and is estimated
from 1g of human faeces
• More modern estimates
range from equal to 100
times!
• American Society for
Microbiology 2014 report
puts the ratio closer to 3:1
• Panel included Peter
Turnbaugh
• There’s still more of them
though….
http://www.bostonglobe.com/ideas/2014/09/13/your-body-mostly-microbes-
actually-have-idea/qlcoKot4wfUXecjeVaFKFN/story.html
5. Microbiome research is undergoing a crisis
Please don’t make things worse
• Crisis 1
– The correlation/causation fallacy. For example….
– Patients with type II diabetes have a different gut microbiome compared
to healthy patients
– Does the microbiome cause diabetes?
– Or do they have a different microbiome because they have diabetes?
(therefore different diet)
• Crisis 2
– A lot of people want to do it, but don’t know how
– Errors, bad experimental design, incorrect conclusions
6. What is the microbiome?
“the ecological community of commensal,
symbiotic, and pathogenic microorganisms that
literally share our body space”
- Joshua Lederberg
Note: includes funghi, protists, archaea, bacteria, algae, viruses etc etc etc
(whisper it: most “microbiome” studies only look at bacteria/archaea)
7. How do we study the microbiome?
• Marker gene vs shotgun metagenomics
• Marker gene
– 16S / 18S / ITS
– Amplify this and compare
• Metagenomics
– Extract all DNA
– Fragment, sequence, interpret
• In theory, the latter least biased*
8. 16S studies are not metagenomics
http://phylogenomics.blogspot.co.uk/2012/08/referring-to-16s-surveys-as.html, http://biomickwatson.wordpress.com/2014/01/12/youre-probably-not-doing-metagenomics/
9. 16S
• Prokaryotic rRNA subunit
• Present in all (?) bacterial/archaeal genomes, contains constant
and hypervariable regions
• Hypervariable regions may give “species specific” signatures
10. 16S process
• Current sequencing technologies can’t sequence whole thing
• Design primers in constant regions and PCR
• Amplify 1 or more hypervariable regions
• Cluster similar sequences into OTUs
• Compare to 16S database and assign phylogenetic group
• Compare abundance across sample groups (QIIME, Mothur)
11. 16S problems
• Some genomes have multiple copies of the 16S gene
• The constant regions aren’t constant
– Design degenerate primers
– Some primers pick up certain groups better than others
– A perfect match primer will amplify better than one containing mis-matches
• The abundances from 16S are wrong, we simply hope that
they are consistently wrong across samples
• Absence really difficult to prove/wrong to assume
• Chimeras, PCR artefacts consisting of 16S gene fragments
from two different molecules
12. • Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ, Weightman AJ. At least 1 in 20 16S rRNA sequence records
currently held in public repositories is estimated to contain substantial anomalies. Appl Environ Microbiol.
2005 71(12):7724-36.
15. Technology Advantages Disadvantages Output per run
Illumina Highly accurate; cheap;
Sequencing: what’s on the market?
industry leader; multiple
platforms
Slower than Ion; short
reads;
HiSeq X Ten: 18Tb
HiSeq X: 1.8Tb
2500:HO 600Gb -> 1Tb
2500:RO: 180Gb
NextSeq: 140Gb
MiSeq: 25Gb
Ion Torrent Fast; cheap machine Very poor on
homopolymers; doesn’t
match Illumina on
throughput
PGM: 2Gb
Proton P1: 10Gb
Proton P2: 30Gb
PacBio Long reads; single molecule High error rate, needs
correction; low
throughput; expensive
machine
300-500Mb
Oxford
Nanopore
MinION
Long reads; single molecule;
cheap; portable
High error rate; unknown
quantity
Unknown
Complete
Genomics
Highly accurate; cheap Limited to human; black
box
Unknown; human
genomes can be purchased
17. 16S sequencing strategy?
• Platform: MiSeq
• Theoretically:
– 2x150bp can sequence ~180bp amplicon
– 2x250bp can sequence ~480bp amplicon
– 2x300bp can sequence ~580bp amplicon
18. Important paper
• Amongst other
things, sequenced
a mock
community with
different
sequencing and
bioinformatics
strategies
• Kozich JJ, Westcott SL, Baxter
NT, Highlander SK, Schloss PD.
Development of a dual-index
sequencing strategy and
curation pipeline for analyzing
amplicon sequence data on
the MiSeq Illumina sequencing
platform. Appl Environ
Microbiol. 2013 S79(17):5112-
20.
19. • Three 16S regions sequenced using 2x250bp
– V4 (~250 bp), V34 (430bp), and V45 regions (~375 bp)
– In the Mock community, there should be 20 OTUs
20. 16S sequencing strategy?
• The only strategy that got close to the correct result is
complete overlap of 2x250bp MiSeq reads
22. Shotgun metagenomics
• Take ecosystem, extract all DNA and sequence it
• Should be unbiased, right?... Right?
• (NB: issues on the next few slides are also issues for
marker gene studies)
23. Extraction protocol
“we found that each DNA
extraction method resulted in
unique community patterns”
“We observed significant differences
in distribution of bacterial taxa
depending on the method.”
24. Storage
“Samples frozen with and without glycerol as cryoprotectant
indicated a major loss of Bacteroidetes in unprotected samples”
25. • In the chicken caecum, bacteroidetes dominate, followed by
firmicutes:
• Nordentoft S et al (2011) The influence of the cage system and colonisation of Salmonella Enteritidis on
the microbial gut flora of laying hens studied by T-RFLP and 454 pyrosequencing. BMC Microbiol. 11:187
26. • In the chicken caecum, firmicutes dominate, few
proteobacteria, no bacteroidetes
• Danzeisen JL et al (2011). Modulations of the chicken cecal microbiome and metagenome in response to
anticoccidial and growth promoter treatment. PLOS ONE. 6(11):e27949.
27. • Did I mention that microbiome research is
undergoing a crisis?
• It gets worse…..
28. Contamination
• Sequenced a pure culture of
Salmonella bongori
• Extracted DNA using different kits
• Did serial dilutions of the pure
culture to assess impact of
contaminating species
29.
30. The kits
• FastDNA Spin Kit For Soil (FP), MoBio UltraClean Microbial
DNA Isolation Kit (MB), QIAmp DNA Stool Mini Kit (QIA) and
PSP Spin Stool DNA Plus kit (PSP)
FP had a stable kit profile dominated by Burkholderia, PSP was dominated by
Bradyrhizobium, while the QIA kit had the most complex mix of bacterial DNA.
Bradyrhizobiaceae, Burkholderiaceae, Chitinophagaceae, Comomonadaceae,
Propionibacteriaceae and Pseudomonadaceae were present in at least three quarters of
the dilutions from PSP, FP and QIA kits. However, relative abundances of taxa at the
Family level varied according to kit: FP was marked by Burkholderiaceae and
Enterobacteriaceae, PSP was marked by Bradyrhizobiaceae and Chitinophagaceae. The
contamination in the QIA kit was relatively diverse in comparison to the other kits, and
included higher proportions of Aerococcaceae, Bacillaceae, Flavobacteriaceae,
Microbacteriaceae, Paenibacillaceae, Planctomycetaceae and Polyangiaceae than the
other kits. Kit MB did not have a distinct contaminant profile and varied from dilution to
dilution due to paucity of reads
31. “These metagenomic results therefore clearly
show that contamination becomes the dominant
feature of sequence data from low biomass
samples, and that the kit used to extract DNA can
have an impact on the observed bacterial
diversity”
32. From Salter et al:
“Tellingly, Laurence et al [1] recently
demonstrated with an in silico
analysis that Bradyrhizobium is a
common contaminant of
sequencing datasets including the
1000 Human Genome Project”
1. Laurence M, Hatzis C, Brash DE.
Common contaminants in next-generation
sequencing that hinder
discovery of low-abundance microbes.
PLoS One. 2014 9(5):e97876.
Adenoids are at the back of the nasal cavity
Bradyrhizobium is a soil bacterium
35. Shotgun metagenomics
• Can assemble
– MetaVelvet, Meta-IDBA, Ray Meta, MetAMOS
– Different techniques for partitioning
• Coverage, sequence composition, connectivity
• MetaWatt, CONCOCT
– Predict genes: Glimmer-MG, FragGenScan
• Use reference
– Kraken, PhyloSift, MetaPhlAn, HUMAnN
36. All-in-one solution
• EBI Metagenomics
• Hunter S, et al. EBI metagenomics--a new resource for the analysis and archiving of
metagenomic data. Nucleic Acids Res. 2014 42(Database issue):D600-6.
39. Conclusions
• I love microbiome research (honestly!)
• Really, incredibly exciting… but….
• Every step counts
• Be very careful, at all stages
• 16S – cheap, biased but effective
• WGS – expensive, information rich, less biased
• Beware contamination, include controls