Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014

Rapid bacterial outbreak
characterisation from whole
genome sequencing
Torsten Seemann
Genome Science: Biology, Technology & Bioinformatics - Wed 13 July 2014 - Oxford, UK - #UKGS2014

About me
● Victorian Bioinformatics Consortium
o Monash University, Melbourne, Australia
● Microbial genomics
o bacterial pathogens; some parasites, viruses, fungi
● Tool development
o Prokka, Nesoni, VelvetOptimiser, Snippy, ...

Microbial Diagnostic Unit
● Oldest public health lab in Australia
o established 1897 in Melbourne
o large historical isolate collection back to 1950s
● National reference laboratory
o Salmonella, Listeria, EHEC
● WHO regional reference lab
o vaccine preventable invasive bacterial pathogens

New director
● Professor Ben Howden
o clinician, microbiologist, pathologist
o early adopter of genomics and bioinformatics
● Mandate
o modernise service delivery
o enhance research output and collaboration
o nationally lead the conversion to WGS

Outbreak scenario
● Receive samples (human, animal, enviro)
● Extract, culture, isolate
● Identification via phenotype, growth, media
● Typing: MLST, MLVA, PFGE, phage, sero, ...
● Screening: VITEK
● Report back to hospital, state government

Traditional typing
● Low resolution
o small subset of genome
 MLST ~7 core genes
 MLVA uses handful of VNTR regions
o requires constant curation of new genotypes
● Labour intensive
o time consuming

Whole Genome Sequencing
● Backward compatible
o can derive most traditional genotypes
● High resolution
o all variation, plasmids, AbR & virulence genes
● High throughput
o cheap, fast - one assay replaces many

Resistance to change
● Protecting empires
o “this is how we’ve always done it”, job redundancies
● Expense of instruments
o capital purchase, new staff, maintenance
● Lack of bioinformatics support
o infrastructure, software, training
● Legal requirements
o must do PFGE, validation, accreditation

A vision for Australia
● A common online system for all labs
o upload samples
o automated standard analysis pipelines
● Access control
o each lab controls their own data
o jurisdictions can share data in national outbreaks
● Deploy on our national research cloud
o no investment or expertise needed
o can deploy private version if desired

Suggested pipeline
● Input
o FASTQ files for each isolate
● Per isolate output
o de novo assembly & annotation
o typing (species dependent)
o antibiotic resistance & virulence genes
● Per outbreak output
o annotated phylogenomic tree
o SNP distances, clonality predictions

Design goals
● Speed
o multi-threaded wherever possible
● Modular
o Unix-style reusable components
● Deployable on cloud
o Amazon, Nectar (.au), CLIMB (.uk)
● Open source
o Auditable, community contribution

Progress
● Currently
o assessing existing components
o implementing new ones - all on GitHub
● No final product yet
o but some components are usable now
● Rolling out in 2015
o labs around Australia will opt in, most are keen

Identifying isolates
● De novo assembly approach
o assemble into contigs
o BLAST contigs against all microbial sequences
o best hits, highest coverage
● Assembly free method
o build index of all microbial k-mers w/ taxonomy
o scan k-mers from reads and tally
o Kraken, BioBloomTools, ...

Kraken report
1.04 1046 1046 U 0 unclassified
98.96 99624 142 - 1 root
98.81 99473 1 - 131567 cellular organisms
98.81 99472 194 D 2 Bacteria
98.57 99233 111 P 1224 Proteobacteria
98.45 99110 318 C 1236 Gammaproteobacteria
98.07 98728 0 O 91347 Enterobacteriales
98.07 98728 52477 F 543 Enterobacteriaceae
44.95 45256 665 G 561 Escherichia
44.20 44498 33391 S 562 Escherichia coli
8.84 8899 8899 - 1274814 Escherichia coli APEC O78
0.29 287 0 - 244319 Escherichia coli O26:H11
0.29 287 287 - 573235 Escherichia coli O26:H11 str 11368
0.21 216 216 - 316401 Escherichia coli ETEC H10407
0.19 193 0 - 168807 Escherichia coli O127:H6
0.19 193 193 - 574521 Escherichia coli O127:H6 str E2348/69
http://ccb.jhu.edu/software/kraken

Assembill
● Decent automated assemblies
o only 3 parameters: outdir + R1.fq.gz + R2.fq.gz
o supports multithreading at all steps
● Main steps
o adaptor removal & quality trimming (Skewer)
o selection of K from k-mer spectra (KmerGenie)
o de novo assembly (Velvet, Spades)
o ordering of contigs against reference (MUMmer)

Prokka
● Prokaryotic Annotation
o only 2 parameters: outdir + contigs.fa
o scales to about 32 threads
● Finds
o CDS, tRNA, tmRNA, rRNA, some ncRNA
o CRISPR, signal peptides
● Produces
o Genbank, GFF3, Sequin, FASTA, ...

mlst
● Multi-Locus Sequence Typing
o only 2 parameters: scheme + contigs.fa
● Can mass-screen hundreds of assemblies
o comes bundled with PubMLST database
● Output
o tab/comma separated values

AbRicate
● Identify known AB resistance genes
o only 1 parameters: contigs.fa
● Only as good as the underlying database
o Bundled with ResFinder
o does not include SNP-based AbR-conferring genes
● Output
o tab/comma separated table

Wombac
● Quickly identify core genome SNPs
● Efficiently use all CPUs and RAM
● Re-use previous reference alignments
● Cheap to calculate new core subsets

Read alignment
Use BWA MEM
● Do not need to clip reads
● Deduces the fragment library attributes
● Marks multi-mapping reads properly
● Scales linearly to >100 cores
● Outputs SAM directly

Sorted BAM
● No intermediate files
o use Unix pipes
● Multiple CPUs with SAMtools > 0.1.19+
o use the -@ command line parameter
bwa → samtools view → samtools sort → BAM

SNP calling
● FreeBayes
o set in haploid mode (p=1)
o set regular parameters (mindepth, minfrac)
o call variants in all samples jointly (more power)
o single multi-isolate VCF output
freebayes -p 1 *.bam → all.vcf

Parallel Freebayes
● FreeBayes is single threaded
o divide genome into regions
o run separate freebayes in parallel on each region
o merge the results
o scales nearly linearly!
fasta-generate-regions.py ref.fa > regions.txt
freebayes-parallel 32 regions.txt -p 1 *.bam → all.vcf

Select core SNPs
● Core SNPs
o position present in every isolate
o more than one allele (not wholly conserved)
o usually ignore indels and other odd genotypes
● Recombination
o not all core SNPs are real
o many result of recombination
o should be filtered out, could alter tree topology

Wombac speed
● Example
o 130 E.coli isolates, MiSeq 300bp PE
o With 32 cores, used < 4GB RAM/core
o Took just over 1 hour
● Add a new sample
o Re-use existing alignments
o Will migrate to gVCF method that GATK will use
● Recalculate a core tree on subset

Email torsten.seemann@gmail.com
Twitter @torstenseemann
Blog
TheGenomeFactory.blogspot.com
Web bioinformatics.net.au
Contact

Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014

Similaire à Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014 (20)

Plus de Torsten Seemann

Plus de Torsten Seemann (6)

Dernier

Dernier (20)

Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014