1. Rapid bacterial outbreak
characterisation from whole
genome sequencing
Torsten Seemann
Genome Science: Biology, Technology & Bioinformatics - Wed 13 July 2014 - Oxford, UK - #UKGS2014
2. About me
● Victorian Bioinformatics Consortium
o Monash University, Melbourne, Australia
● Microbial genomics
o bacterial pathogens; some parasites, viruses, fungi
● Tool development
o Prokka, Nesoni, VelvetOptimiser, Snippy, ...
3. Microbial Diagnostic Unit
● Oldest public health lab in Australia
o established 1897 in Melbourne
o large historical isolate collection back to 1950s
● National reference laboratory
o Salmonella, Listeria, EHEC
● WHO regional reference lab
o vaccine preventable invasive bacterial pathogens
4. New director
● Professor Ben Howden
o clinician, microbiologist, pathologist
o early adopter of genomics and bioinformatics
● Mandate
o modernise service delivery
o enhance research output and collaboration
o nationally lead the conversion to WGS
5. Outbreak scenario
● Receive samples (human, animal, enviro)
● Extract, culture, isolate
● Identification via phenotype, growth, media
● Typing: MLST, MLVA, PFGE, phage, sero, ...
● Screening: VITEK
● Report back to hospital, state government
6. Traditional typing
● Low resolution
o small subset of genome
MLST ~7 core genes
MLVA uses handful of VNTR regions
o requires constant curation of new genotypes
● Labour intensive
o time consuming
7. Whole Genome Sequencing
● Backward compatible
o can derive most traditional genotypes
● High resolution
o all variation, plasmids, AbR & virulence genes
● High throughput
o cheap, fast - one assay replaces many
8. Resistance to change
● Protecting empires
o “this is how we’ve always done it”, job redundancies
● Expense of instruments
o capital purchase, new staff, maintenance
● Lack of bioinformatics support
o infrastructure, software, training
● Legal requirements
o must do PFGE, validation, accreditation
9. A vision for Australia
● A common online system for all labs
o upload samples
o automated standard analysis pipelines
● Access control
o each lab controls their own data
o jurisdictions can share data in national outbreaks
● Deploy on our national research cloud
o no investment or expertise needed
o can deploy private version if desired
10. Suggested pipeline
● Input
o FASTQ files for each isolate
● Per isolate output
o de novo assembly & annotation
o typing (species dependent)
o antibiotic resistance & virulence genes
● Per outbreak output
o annotated phylogenomic tree
o SNP distances, clonality predictions
11. Design goals
● Speed
o multi-threaded wherever possible
● Modular
o Unix-style reusable components
● Deployable on cloud
o Amazon, Nectar (.au), CLIMB (.uk)
● Open source
o Auditable, community contribution
12. Progress
● Currently
o assessing existing components
o implementing new ones - all on GitHub
● No final product yet
o but some components are usable now
● Rolling out in 2015
o labs around Australia will opt in, most are keen
13. Identifying isolates
● De novo assembly approach
o assemble into contigs
o BLAST contigs against all microbial sequences
o best hits, highest coverage
● Assembly free method
o build index of all microbial k-mers w/ taxonomy
o scan k-mers from reads and tally
o Kraken, BioBloomTools, ...
15. Assembill
● Decent automated assemblies
o only 3 parameters: outdir + R1.fq.gz + R2.fq.gz
o supports multithreading at all steps
● Main steps
o adaptor removal & quality trimming (Skewer)
o selection of K from k-mer spectra (KmerGenie)
o de novo assembly (Velvet, Spades)
o ordering of contigs against reference (MUMmer)
16. Prokka
● Prokaryotic Annotation
o only 2 parameters: outdir + contigs.fa
o scales to about 32 threads
● Finds
o CDS, tRNA, tmRNA, rRNA, some ncRNA
o CRISPR, signal peptides
● Produces
o Genbank, GFF3, Sequin, FASTA, ...
17. mlst
● Multi-Locus Sequence Typing
o only 2 parameters: scheme + contigs.fa
● Can mass-screen hundreds of assemblies
o comes bundled with PubMLST database
● Output
o tab/comma separated values
18. AbRicate
● Identify known AB resistance genes
o only 1 parameters: contigs.fa
● Only as good as the underlying database
o Bundled with ResFinder
o does not include SNP-based AbR-conferring genes
● Output
o tab/comma separated table
19. Wombac
● Quickly identify core genome SNPs
● Efficiently use all CPUs and RAM
● Re-use previous reference alignments
● Cheap to calculate new core subsets
20. Read alignment
Use BWA MEM
● Do not need to clip reads
● Deduces the fragment library attributes
● Marks multi-mapping reads properly
● Scales linearly to >100 cores
● Outputs SAM directly
21. Sorted BAM
● No intermediate files
o use Unix pipes
● Multiple CPUs with SAMtools > 0.1.19+
o use the -@ command line parameter
bwa → samtools view → samtools sort → BAM
22. SNP calling
● FreeBayes
o set in haploid mode (p=1)
o set regular parameters (mindepth, minfrac)
o call variants in all samples jointly (more power)
o single multi-isolate VCF output
freebayes -p 1 *.bam → all.vcf
23. Parallel Freebayes
● FreeBayes is single threaded
o divide genome into regions
o run separate freebayes in parallel on each region
o merge the results
o scales nearly linearly!
fasta-generate-regions.py ref.fa > regions.txt
freebayes-parallel 32 regions.txt -p 1 *.bam → all.vcf
24. Select core SNPs
● Core SNPs
o position present in every isolate
o more than one allele (not wholly conserved)
o usually ignore indels and other odd genotypes
● Recombination
o not all core SNPs are real
o many result of recombination
o should be filtered out, could alter tree topology
25. Wombac speed
● Example
o 130 E.coli isolates, MiSeq 300bp PE
o With 32 cores, used < 4GB RAM/core
o Took just over 1 hour
● Add a new sample
o Re-use existing alignments
o Will migrate to gVCF method that GATK will use
● Recalculate a core tree on subset