SlideShare a Scribd company logo
1 of 44
Download to read offline
So I have sequenced my
organism … what do I do now?
Nick Loman
Oh dear
Sequence some more
Sensible
Useful things
Whole-genome sequencing:
utility in clinical microbiology
• Diagnostics
– Species, subspecies, strain identification
– In silico antibiogram
– In silico virulence profile
• Surveillance
• Typing (including backwards compatibility with MLST and
serotype)
• What strains and resistance elements are lurking in my
hospital/community?
• Forensic epidemiology
– Is there an outbreak?
• Who gave what to who?
Common types of sequencing
• Paired-end Illumina (typically 150 – 300 bases)
• Single-end Ion Torrent (typically 300-400
bases)
– Can be treated more or less the same
• Pacific Biosciences or Oxford Nanopore
– Requires special handling, not covered today
Quality Control: Questions to Ask
• Did my sequencing work?
• What are the fragment lengths?
• Is my sample what I think it is?
• Is my sample contaminated?
Read QC
Adaptor/quality
trimming
Species ID
Sample QC
FastQC, Qualimap,
Kraken, BLAST
Trimmomatic
BLAST, Metaphlan,
MOCAT
Blobology
Did my sequencing work?
• FastQC:
What coverage do I have?
• SNP calling: >10x (>15x better)
• De novo assembly: >30x (50x probably better)
• Absolutely no benefits over about 100x for
standard applications and slows everything
down and takes more disk space
• (BTW, FASTQ files are probably a waste of
space)
What are the fragment lengths?
• Qualimap (or just BWA)
Bad
Fragment length < read
length
OK
Fragment length > read
length
Good
Fragment length > 2x read
length
You are in dangerous territory dealing with
repetitive regions longer than the fragment
length, regardless of read depth coverage
Repetitive regions
This is important because repeat-containing are often
the most interesting parts of the genome! Think:
• Insertion elements
• Transposons
• Plasmids
• Ribosomal RNA
REPEAT: You are in dangerous territory dealing
with repetitive regions longer than the fragment
length, regardless of read depth coverage
Do not trust the computer
Bioinformatics software will do its best to look
like it is dealing with repeats in a rational way,
but it is in fact plotting aggressively to ruin your
analysis without telling you.
Computers are just like that!
If repeats are important to your analysis, you need an
alternative sequencing strategy: long mate-pairs, long reads
(Pacific Biosciences or Oxford Nanopore). Don’t drive
yourself mad making short reads do what they can’t.
Adaptor trim reads
• With Nextera libraries, failing to adaptor trim
will KILL your assemblies.
• Particularly important when mean fragment
length < read length.
• Many trimmers available: I like to use
Trimmomatic
• Quality trimming not important with modern
tools (BWA and Spades)
For more explanation: http://nickloman.github.io/high-
throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-
experiences-with-nextera-libraries/
Is my sample what I think it is?
• BLASTing a few random reads usually very
efficient quality control check, as well as
helping identify a reference genome
• Kraken or Metaphlan can give rapid organism
report
Species identification
• Methods:
– 16S rDNA extraction (typically following de novo
assembly and annotation) and BLAST
– Taxon-defining genes (e.g. Metaphlan)
– Phylogenetic approach (e.g. MOCAT, Phylosift)
For more explanation: http://nickloman.github.io/high-
throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-
experiences-with-nextera-libraries/
Isolate genome
Sequence reads
Other samples on
sequencing run
Contamination
Unsequenced
regions
Sources of contamination
• Accidental multiple colony picks or mixed liquid
culture
– Same or different organism
– E.g. Achromobacter & Pseudomonas aeruginosa in CF
• Reagent contamination (DNA extractions)
• Sequencer “carry-over” (0.2%?)
• PhiX control sequence <- don’t be this guy
• Barcode “cross-over” (bad pipetting technique or
contaminated reagents)
Blobology
Contamination
Adaptor trim reads
• With Nextera libraries, failing to adaptor trim
will KILL your assemblies.
• Particularly important when mean fragment
length < read length.
• Many trimmers available: I like to use
Trimmomatic
For more explanation: http://nickloman.github.io/high-
throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-
experiences-with-nextera-libraries/
Reference-based or de novo?
Reference-based or de novo?
• Reference-based
– Implies ALIGNMENT to reference
– Implies you HAVE a reference
– Allows exquisitely sensitive and specific SNP calling
(forensic SNP calling to single mutation precision)
– Important for looking at CHAINS OF TRANSMISSION
– Can only call in parts of the genome COMMON
between your SAMPLES and REFERENCE: the CORE
Reference-based or de novo?
• De-novo
– Implies de novo assembly
– Does NOT require a reference
– Gives access to the entire PAN-genome
– E.g.
• Unexpected antibiotic resistance genes
• Virulence factors
– Can give misleading results in REPEAT sequences
– Not suitable for very fine-resolution SNP analysis
In practice
• Most people will want to do both.
• And if you have no reference, you can use a
draft de novo assembly AS your reference
– But exercise caution
Reference-based approach
Alignment
Variant calling
SNP extraction & filter
Recombination
filtering
Tree building
MLST/Antibiogram
Read QC
Adaptor/quality
trimming
Species ID
Sample QC
FastQC, Qualimap,
Kraken, BLAST
Trimmomatic
BLAST, Metaphlan,
MOCAT
Blobology
BWA
Samtools/VarScan
GATK
Custom script, snippy,
snpEff, BRESEQ
Gubbins,
ClonalFrameML
FastTree, RaXML
SRST2
Analysis choice highly species
dependent: not one size fits all!
• What is the mode and tempo of evolution?
• Monomorphic organisms:
– Characterised by vertical pattern of inheritance
– Isolates differ by few mutations
• Highly recombinogenic organisms
– Mutations dominated by recombination
– May have vast differences in gene content, gene
order
– “Clonal frame” may be obscured or absent
Different species require different
analysis strategies
Variation
M. tuberculosis
S. aureus
B. anthracis
E. coli
P. aeruginosa
N. meningitidis
S. pneumoniae
Clonal population structure
Branching phylogenies
Open pan-genome
Horizontal gene transfer
Salmonella
High rates of recombination
Phylogenetic networks
Tips for picking a reference
• The higher quality the better (aim for pre-NGS
Sanger genomes, e.g. <2001)
• Ideally single contig, no gaps
• Canonical strains have most portable and
referenced gene references, e.g. TB H37Rv,
PAO1, E. coli K-12 etc.
• For SNP calling specificity: more closely
related is better
The core genome
• The core genome used to
call SNPs will reduce as
more genomes are added
• Particularly noticeable in
species with highly
plastic genomes: E. coli
• Has significance for
forensic applications
Is my reference good enough?
• Assess core genome size
– Harvest will do this for you
• Or look at samtools flagstat (?)
• Between-sample SNP calling efficiency goes
down with reference divergence
• Luxury option: get a Pacific Biosciences
complete reference done for each “clone” in
your dataset (for some definition of clone)
Effect of closer reference on P.
aeruginosa genotyping
SNPs Indels Mapped
PAO1
Reference
23 4 77%
PacBio
Reference
40 5 97%
Quick, Loman et al. BMJ Open 2014
SNP filtering
• Specific SNP dataset is vital for effective
phylogenetic reconstructions and outbreak
tracing
• Most SNP calling errors come from
– A) misalignment (sequence present in sample but not
in reference, align)
– B) copy number variation (2 copies in sample, 1 copy
in reference)
• NOT from sequencing error (at least with
Illumina: systematic errors with other platforms)
SNP filtering (2)
• Allele frequency filter is most effective SNP filter
– AF > 0.9 (90%) works very well empirically
• Strand filter also very useful to prevent SNPs
around structural variations
• Filtering for low coverage not that helpful:
– 1/1000 error (Q30) * minimum of 3 coverage =
.000000001 chance of an error per position = < 1
error per genome
• Avoid SNPs at ends of contigs as these may be
mismapping
Detecting recombination
• Simple algorithms rely on SNP density, more
complex ones asssess impact on “clonal
frame”
Normal SNP density Recombining region
Impact of recombination filtering
De novo approach
• Interrogate the accessory genome
– Novel genes
• Some important applications take contigs
rather than reads as primary input
• SNP calling with de novo assembly is
fundamentally less reliable due to lack of
allele frequency information; but fine for
broad-scale clustering
Reference-based approach
Alignment
Variant calling
SNP extraction & filter
Recombination
filtering
Tree building
MLST/Antibiogram
Read QC
Adaptor/quality
trimming
Species ID
Sample QC
FastQC, Qualimap
Trimmomatic
BLAST, Metaphlan,
MOCAT
Blobology, Kraken,
BLAST
BWA
Samtools/VarScan
GATK
Custom script, snippy
Gubbins,
ClonalFrameML
FastTree, RaXML
SRST2
De novo approach
Assembly
MLST/Antibiogram
Annotation
Tree building
Population genomics
Pan-genome
Velvet
SPADES
Prokka
Harvest
BigsDB
Phyloviz
LS-BSR
mlst, Abricate
Concluding thoughts
1. Don’t trust your sequencing data (or others’)
– sense-check and validate each step
2. Make extensive use of visualisation tools to
do this
3. There’s more than one way to do any one
task
CLoud Infrastructure for Microbial
Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for
microbial bioinformatics
• £4M of hardware, capable
of supporting >1000
individual virtual servers
• Amazon/Google cloud for
Academics
Meet-The-Expert
• Meet-The-Expert: Joao Carrico and I
• Tomorrow (Monday)
• 07:45 (really)
• Hall M
• Session ME11 What bioinformatics tools do I use for whole-
genome sequence (WGS)-based bacterial diagnostics and
typing?
Acknowledgements
• Twitter comments:
– Tom Connor, Alan McNally, Torsten Seemann, C.
Titus Brown, Heng Li, Christoffer Flensburg, Matt
MacManes, Rachel Glover, Willem van Schaik, Bill
Hanage, Jennifer Gardy, Mick Watson, Alan
McNally, Esther Robinson, Nicola Fawcett, Aziz
Aboobaker, Ruth Massey

More Related Content

What's hot

RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...Jan Aerts
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsGolden Helix Inc
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngsDin Apellidos
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challengesLex Nederbragt
 
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesTools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesSurya Saha
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowBrian Krueger
 
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeThe Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeJustin Johnson
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.mkim8
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Yaoyu Wang
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingVishal Pandey
 
Exploring new frontiers with next-generation sequencing
Exploring new frontiers with next-generation sequencingExploring new frontiers with next-generation sequencing
Exploring new frontiers with next-generation sequencingQIAGEN
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015Torsten Seemann
 
Expanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSExpanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSIntegrated DNA Technologies
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewSean Davis
 

What's hot (20)

RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Ngs part i 2013
Ngs part i 2013Ngs part i 2013
Ngs part i 2013
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and Variants
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
 
NGS: bioinformatic challenges
NGS: bioinformatic challengesNGS: bioinformatic challenges
NGS: bioinformatic challenges
 
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesTools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can Know
 
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeThe Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
The Next, Next Generation of Sequencing - From Semiconductor to Single Molecule
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Exploring new frontiers with next-generation sequencing
Exploring new frontiers with next-generation sequencingExploring new frontiers with next-generation sequencing
Exploring new frontiers with next-generation sequencing
 
Ngs introduction
Ngs introductionNgs introduction
Ngs introduction
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
 
Expanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSExpanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGS
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 

Similar to ECCMID 2015 - So I have sequenced my genome ... what now?

Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Mark Pallen
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data AnalysisRavi Gandham
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods Mrinal Vashisth
 
Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014LutzFr
 
QIAseq Targeted DNA, RNA and Fusion Gene Panels
QIAseq Targeted DNA, RNA and Fusion Gene PanelsQIAseq Targeted DNA, RNA and Fusion Gene Panels
QIAseq Targeted DNA, RNA and Fusion Gene PanelsQIAGEN
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Nathan Olson
 
High Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeHigh Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeBrian Krueger
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencingshinycthomas
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...OECD Environment
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Festival of Genomics Jan 2018
Festival of Genomics Jan 2018Festival of Genomics Jan 2018
Festival of Genomics Jan 2018Graham Taylor
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification Senthil Natesan
 
Cignal lenti webinar
Cignal lenti webinarCignal lenti webinar
Cignal lenti webinarElsa von Licy
 
Genome sequencing. ppt.pptx
Genome sequencing. ppt.pptxGenome sequencing. ppt.pptx
Genome sequencing. ppt.pptxGedifewGebrie
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 

Similar to ECCMID 2015 - So I have sequenced my genome ... what now? (20)

Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
 
Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014
 
QIAseq Targeted DNA, RNA and Fusion Gene Panels
QIAseq Targeted DNA, RNA and Fusion Gene PanelsQIAseq Targeted DNA, RNA and Fusion Gene Panels
QIAseq Targeted DNA, RNA and Fusion Gene Panels
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.
 
High Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genomeHigh Throughput Sequencing Technologies: On the path to the $0* genome
High Throughput Sequencing Technologies: On the path to the $0* genome
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
NGS.pptx
NGS.pptxNGS.pptx
NGS.pptx
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Festival of Genomics Jan 2018
Festival of Genomics Jan 2018Festival of Genomics Jan 2018
Festival of Genomics Jan 2018
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
 
Cignal lenti webinar
Cignal lenti webinarCignal lenti webinar
Cignal lenti webinar
 
Genome sequencing. ppt.pptx
Genome sequencing. ppt.pptxGenome sequencing. ppt.pptx
Genome sequencing. ppt.pptx
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 

Recently uploaded

Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Christina Parmionova
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerLuis Miguel Chong Chong
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsDanielBaumann11
 
Introduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsIntroduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsCreative-Biolabs
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasChayanika Das
 
Interpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWSTInterpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWSTAlexander F. Mayer
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's SurvivalHarry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survivalkevin8smith
 
Think Science: What Are Eclipses (101), by Craig Bobchin
Think Science: What Are Eclipses (101), by Craig BobchinThink Science: What Are Eclipses (101), by Craig Bobchin
Think Science: What Are Eclipses (101), by Craig BobchinNathan Cone
 
Role of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxRole of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxjana861314
 

Recently uploaded (20)

Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of Cancer
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
 
Introduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsIntroduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative Biolabs
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
 
Interpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWSTInterpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWST
 
Introduction Classification Of Alkaloids
Introduction Classification Of AlkaloidsIntroduction Classification Of Alkaloids
Introduction Classification Of Alkaloids
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's SurvivalHarry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
 
Think Science: What Are Eclipses (101), by Craig Bobchin
Think Science: What Are Eclipses (101), by Craig BobchinThink Science: What Are Eclipses (101), by Craig Bobchin
Think Science: What Are Eclipses (101), by Craig Bobchin
 
Role of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptxRole of Gibberellins, mode of action and external applications.pptx
Role of Gibberellins, mode of action and external applications.pptx
 

ECCMID 2015 - So I have sequenced my genome ... what now?

  • 1. So I have sequenced my organism … what do I do now? Nick Loman
  • 2.
  • 7. Whole-genome sequencing: utility in clinical microbiology • Diagnostics – Species, subspecies, strain identification – In silico antibiogram – In silico virulence profile • Surveillance • Typing (including backwards compatibility with MLST and serotype) • What strains and resistance elements are lurking in my hospital/community? • Forensic epidemiology – Is there an outbreak? • Who gave what to who?
  • 8. Common types of sequencing • Paired-end Illumina (typically 150 – 300 bases) • Single-end Ion Torrent (typically 300-400 bases) – Can be treated more or less the same • Pacific Biosciences or Oxford Nanopore – Requires special handling, not covered today
  • 9. Quality Control: Questions to Ask • Did my sequencing work? • What are the fragment lengths? • Is my sample what I think it is? • Is my sample contaminated? Read QC Adaptor/quality trimming Species ID Sample QC FastQC, Qualimap, Kraken, BLAST Trimmomatic BLAST, Metaphlan, MOCAT Blobology
  • 10. Did my sequencing work? • FastQC:
  • 11. What coverage do I have? • SNP calling: >10x (>15x better) • De novo assembly: >30x (50x probably better) • Absolutely no benefits over about 100x for standard applications and slows everything down and takes more disk space • (BTW, FASTQ files are probably a waste of space)
  • 12. What are the fragment lengths? • Qualimap (or just BWA) Bad Fragment length < read length OK Fragment length > read length Good Fragment length > 2x read length You are in dangerous territory dealing with repetitive regions longer than the fragment length, regardless of read depth coverage
  • 13. Repetitive regions This is important because repeat-containing are often the most interesting parts of the genome! Think: • Insertion elements • Transposons • Plasmids • Ribosomal RNA REPEAT: You are in dangerous territory dealing with repetitive regions longer than the fragment length, regardless of read depth coverage
  • 14. Do not trust the computer Bioinformatics software will do its best to look like it is dealing with repeats in a rational way, but it is in fact plotting aggressively to ruin your analysis without telling you. Computers are just like that! If repeats are important to your analysis, you need an alternative sequencing strategy: long mate-pairs, long reads (Pacific Biosciences or Oxford Nanopore). Don’t drive yourself mad making short reads do what they can’t.
  • 15. Adaptor trim reads • With Nextera libraries, failing to adaptor trim will KILL your assemblies. • Particularly important when mean fragment length < read length. • Many trimmers available: I like to use Trimmomatic • Quality trimming not important with modern tools (BWA and Spades) For more explanation: http://nickloman.github.io/high- throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die- experiences-with-nextera-libraries/
  • 16. Is my sample what I think it is? • BLASTing a few random reads usually very efficient quality control check, as well as helping identify a reference genome • Kraken or Metaphlan can give rapid organism report
  • 17. Species identification • Methods: – 16S rDNA extraction (typically following de novo assembly and annotation) and BLAST – Taxon-defining genes (e.g. Metaphlan) – Phylogenetic approach (e.g. MOCAT, Phylosift) For more explanation: http://nickloman.github.io/high- throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die- experiences-with-nextera-libraries/
  • 18. Isolate genome Sequence reads Other samples on sequencing run Contamination Unsequenced regions
  • 19.
  • 20. Sources of contamination • Accidental multiple colony picks or mixed liquid culture – Same or different organism – E.g. Achromobacter & Pseudomonas aeruginosa in CF • Reagent contamination (DNA extractions) • Sequencer “carry-over” (0.2%?) • PhiX control sequence <- don’t be this guy • Barcode “cross-over” (bad pipetting technique or contaminated reagents)
  • 21.
  • 23. Adaptor trim reads • With Nextera libraries, failing to adaptor trim will KILL your assemblies. • Particularly important when mean fragment length < read length. • Many trimmers available: I like to use Trimmomatic For more explanation: http://nickloman.github.io/high- throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die- experiences-with-nextera-libraries/
  • 25. Reference-based or de novo? • Reference-based – Implies ALIGNMENT to reference – Implies you HAVE a reference – Allows exquisitely sensitive and specific SNP calling (forensic SNP calling to single mutation precision) – Important for looking at CHAINS OF TRANSMISSION – Can only call in parts of the genome COMMON between your SAMPLES and REFERENCE: the CORE
  • 26. Reference-based or de novo? • De-novo – Implies de novo assembly – Does NOT require a reference – Gives access to the entire PAN-genome – E.g. • Unexpected antibiotic resistance genes • Virulence factors – Can give misleading results in REPEAT sequences – Not suitable for very fine-resolution SNP analysis
  • 27. In practice • Most people will want to do both. • And if you have no reference, you can use a draft de novo assembly AS your reference – But exercise caution
  • 28. Reference-based approach Alignment Variant calling SNP extraction & filter Recombination filtering Tree building MLST/Antibiogram Read QC Adaptor/quality trimming Species ID Sample QC FastQC, Qualimap, Kraken, BLAST Trimmomatic BLAST, Metaphlan, MOCAT Blobology BWA Samtools/VarScan GATK Custom script, snippy, snpEff, BRESEQ Gubbins, ClonalFrameML FastTree, RaXML SRST2
  • 29. Analysis choice highly species dependent: not one size fits all! • What is the mode and tempo of evolution? • Monomorphic organisms: – Characterised by vertical pattern of inheritance – Isolates differ by few mutations • Highly recombinogenic organisms – Mutations dominated by recombination – May have vast differences in gene content, gene order – “Clonal frame” may be obscured or absent
  • 30. Different species require different analysis strategies Variation M. tuberculosis S. aureus B. anthracis E. coli P. aeruginosa N. meningitidis S. pneumoniae Clonal population structure Branching phylogenies Open pan-genome Horizontal gene transfer Salmonella High rates of recombination Phylogenetic networks
  • 31. Tips for picking a reference • The higher quality the better (aim for pre-NGS Sanger genomes, e.g. <2001) • Ideally single contig, no gaps • Canonical strains have most portable and referenced gene references, e.g. TB H37Rv, PAO1, E. coli K-12 etc. • For SNP calling specificity: more closely related is better
  • 32. The core genome • The core genome used to call SNPs will reduce as more genomes are added • Particularly noticeable in species with highly plastic genomes: E. coli • Has significance for forensic applications
  • 33. Is my reference good enough? • Assess core genome size – Harvest will do this for you • Or look at samtools flagstat (?) • Between-sample SNP calling efficiency goes down with reference divergence • Luxury option: get a Pacific Biosciences complete reference done for each “clone” in your dataset (for some definition of clone)
  • 34. Effect of closer reference on P. aeruginosa genotyping SNPs Indels Mapped PAO1 Reference 23 4 77% PacBio Reference 40 5 97% Quick, Loman et al. BMJ Open 2014
  • 35. SNP filtering • Specific SNP dataset is vital for effective phylogenetic reconstructions and outbreak tracing • Most SNP calling errors come from – A) misalignment (sequence present in sample but not in reference, align) – B) copy number variation (2 copies in sample, 1 copy in reference) • NOT from sequencing error (at least with Illumina: systematic errors with other platforms)
  • 36. SNP filtering (2) • Allele frequency filter is most effective SNP filter – AF > 0.9 (90%) works very well empirically • Strand filter also very useful to prevent SNPs around structural variations • Filtering for low coverage not that helpful: – 1/1000 error (Q30) * minimum of 3 coverage = .000000001 chance of an error per position = < 1 error per genome • Avoid SNPs at ends of contigs as these may be mismapping
  • 37. Detecting recombination • Simple algorithms rely on SNP density, more complex ones asssess impact on “clonal frame” Normal SNP density Recombining region
  • 39. De novo approach • Interrogate the accessory genome – Novel genes • Some important applications take contigs rather than reads as primary input • SNP calling with de novo assembly is fundamentally less reliable due to lack of allele frequency information; but fine for broad-scale clustering
  • 40. Reference-based approach Alignment Variant calling SNP extraction & filter Recombination filtering Tree building MLST/Antibiogram Read QC Adaptor/quality trimming Species ID Sample QC FastQC, Qualimap Trimmomatic BLAST, Metaphlan, MOCAT Blobology, Kraken, BLAST BWA Samtools/VarScan GATK Custom script, snippy Gubbins, ClonalFrameML FastTree, RaXML SRST2 De novo approach Assembly MLST/Antibiogram Annotation Tree building Population genomics Pan-genome Velvet SPADES Prokka Harvest BigsDB Phyloviz LS-BSR mlst, Abricate
  • 41. Concluding thoughts 1. Don’t trust your sequencing data (or others’) – sense-check and validate each step 2. Make extensive use of visualisation tools to do this 3. There’s more than one way to do any one task
  • 42. CLoud Infrastructure for Microbial Bioinformatics (CLIMB) • MRC funded project to develop Cloud Infrastructure for microbial bioinformatics • £4M of hardware, capable of supporting >1000 individual virtual servers • Amazon/Google cloud for Academics
  • 43. Meet-The-Expert • Meet-The-Expert: Joao Carrico and I • Tomorrow (Monday) • 07:45 (really) • Hall M • Session ME11 What bioinformatics tools do I use for whole- genome sequence (WGS)-based bacterial diagnostics and typing?
  • 44. Acknowledgements • Twitter comments: – Tom Connor, Alan McNally, Torsten Seemann, C. Titus Brown, Heng Li, Christoffer Flensburg, Matt MacManes, Rachel Glover, Willem van Schaik, Bill Hanage, Jennifer Gardy, Mick Watson, Alan McNally, Esther Robinson, Nicola Fawcett, Aziz Aboobaker, Ruth Massey

Editor's Notes

  1. Reminds me of an old joke: A man is travelling and stops an old man on the road and says “How do I get to xyz?”. The man pauses and has a good think about it. He asks “You want to get to xyz?”. He pauses again and concludes: “Well if I wanted to get to xyz, I wouldn’t have started from here.”
  2. Caution with filtering: several important antibiotic resistance mutations may occur in just several copies of a repetitive gene, e.g. 23S (linezolid resistance) - filtering will exclude these!