Enterprise Document Management System - Qualityze Inc
Variation Graphs and Structural Variation
1. vg: the variation graph toolkit
Eric Dawson
October 2016@erictdawson
1
2. Variation graphs
"Variation graphs provide a succinct encoding of the sequences of many genomes.
A variation graph (in particular as implemented in vg) is composed of:
• nodes, which are labeled by sequences and ids
• edges, which connect two nodes via either of their respective ends
• paths, describe genomes, sequence alignments, and annotations (such as gene
models and transcripts) as walks through nodes connected by edges"*
*From the vg wiki
Variation graphs allow us to map directly against known (or
proposed) variation.
2
4. Variation graphs are pangenomes
“Computational Pan-Genomics: Status,
Promises and Challenges.”
Computational Pan-Genomics Consortium.
Briefings in Bioinformatics (2016) in press
Pangenomes should fulfill a
number of basic functions.
Erik Garrison, A toolkit for practical pangenomics, ECCB 2016
4
5. Variation graphs are pangenomes
We've implemented most
of these operations in vg.
github.com/vgteam/vg
Erik Garrison, A toolkit for practical pangenomics, ECCB 2016
5
6. Constructing graphs
construct - build graphs from paired FASTA/VCF files.
msga - use progressive assembly to generate a graph from input
sequences.
an example of a graph made with MSGA
@erikgarrison
6
7. Graph modification
mod - prune and normalize graphs, shorten nodes, and much more.
circularize - circularize paths in the graph.
ids - coordinate the ID spaces of multiple subgraphs.
Circular HPV variation graph
@erictdawson & Sarah Wagner (NCI)7
8. Indexing variation graphs
The core, mutable VG data structures support graph manipulation,
queries, and alignment, but they are not scalable.
Thus...
XG - immutable bidirectional graph index
gPBWT - graph-generalized positional BWT (Adam Novak)
GCSA2 - path index queries for variation and de Bruijn graphs (Jouni
Sirén)
8
9. Indexing variation graphs
The core, mutable VG data structures support graph manipulation,
queries, and alignment, but they are not scalable.
Thus...
XG - immutable bidirectional graph index
gPBWT - graph-generalized positional BWT (Adam Novak)
GCSA2 - path index queries for variation and de Bruijn graphs (Jouni
Sirén)
These structures enable mapping at whole-genome scale.
9
10. Mapping to a variation graph
vg can operate on arbitrary sequence graphs. All other graph-based
resequencing implementations (GRAL, BWBBLE, vBWT, GCSAv1)
require a DAG.
Local alignment to cyclic graphs is provided by unrolling:
10
11. Exact-match guided alignment in vg
@erikgarrison
Obtain MEMs from GCSA2, cluster MEMs with xg's positional index,
then fully resolve alignment for non-matching portions using dynamic
programming.
11
12. Obtain MEMs from GCSA2, cluster MEMs with xg's positional index,
then fully resolve alignment for non-matching portions using dynamic
programming.
Exact-match guided alignment in vg
@erikgarrison
12
13. Mapping to a variation graph
1 billion reads / 32 vCPUs / 30 hours
13
14. Small variant calling in vg
call - bubble informed pileup-based caller
genotype - FreeBayes style genotyping using graph augmentation
and superbubble detection
14
16. Interchange with other programs
vectorize - export Alignments as Vowpal-Wabbit vectors for ML
view - GFA/JSON/DOT for many graph entities
1.0 1 ref_1A | ref 1 1 0 1 0 1 0 1
16
17. Other functions
locify(beta) - extract relevant info for external phasing.
deconstruct(beta) - extract an input VCF from the graph.
sim - simulate reads and exact alignments from the graph.
stats - print relevant graph properties.
surject - push graph alignments to BAM space.
sift/scrub(beta) - filter / select alignments by mapping properties.
translate - lift graph coordinates between graphs.
17
19. Structural variation
Loosely defined as changes to the genomic sequence >50bp in length.
1. Balanced events
1. Inversions
2. Translocations
2. Unbalanced events
1. Insertions
2. Deletions
3. Duplications
3. "Complex" - not shown
- Multiple events occurring
in tandem, specific time
series of events, etc.
19
20. Why are we talking about SVs?
1. Evidence for structural variation in the NCI Chernobyl
study.
• Matched tumor/normal samples from >400 individuals exposed to 131
I
post-Chernobyl who subsequently developed papillary thyroid carcinoma.
2. Variation graphs provide novel ways to locate and score
SV calls.
• Graphs are mutable - candidate variation can be inserted, mapped
against, and refined.
20
23. Evidence of SV in Chernobyl data
Median and (Total)
Lumpy Calls
DEL INV
Other
(excl. BND)
Tumor (12) 9074 (187,150) 3925 (53,643) 2185 (25,182)
Normal (11*) 5059 (60,508) 378 (9,494) 1234 (13,682)
Blood (12) 6634 (88,985) 98 (1,195) 1247 (14,181)
Median and total call numbers (as well as analysis by others)
indicate we might expect a high burden of deletions and
inversions in tumours from our dataset (relative to normals)
after normalization and QC.
*No A90N normal tissue sample;
A90G metastasis sequenced instead23
25. Why use a variation graphs and not [your favorite SV caller here] ?
25
26. Read alignment is Bayesian
Our best alignment gets the
highest posterior score
The reference
genome
The read from the sequencer
26
@erikgarrison
27. Detecting SVs with vg
graph
mapq soft clipping mate orientation
reads
fragment length path divergence unmapped reads
sample
We're stuck with our reads, but with variation graphs we can
sample from possible reference priors to maximize:
P( reference | reads)
We can also refine our breakpoints using this approach.
Informative for SV type
Uninformative
27
33. SV realignment / calling in vg
1. Map our reads
2. Collect read signatures
- soft clipped reads
- discordant fragment length
3. Create candidate alleles based on
signatures and add them to graph
33
34. SV realignment / calling in vg
1. Map our reads
2. Collect read signatures
- soft clipped reads
- discordant fragment length
3. Create candidate alleles based on
signatures and add them to graph
4. Remap our reads, score the new
alignments, and repeat if necessary
34
35. Early results - small variants
We submitted the first whole human genome analysis using variation graph
reference methods as part of the PrecisionFDA resequencing competition.
(May 2016.)
We did not win... but we did get a star for:
We're now approaching 99% F-score with vg call.
Erik Garrison, A toolkit for practical pangenomics, ECCB 2016
35
36. Progress - structural variants
Pipeline still under development:
• Deletions in time for the holidays.
• Inversions, insertions, and duplications to follow.
36
37. Opportunities for improvement
• Variant calling at all scales remains unsolved.
• Haplotype phasing on the graph unexplored.
• Currently no local assembly functionality.
• Exploring fermi-lite, but by no means finished.
• Few downstream analysis options
• Apps can easily operate on vg structures, JSON, GFA, etc.
• Additional user feedback / requests would be useful.
and many more...
37
38. Where do I see vg making a difference in my
work?*
• Viral coinfection detection / classification
• Multiclonal tumors
• SV breakpoint refinement
• Tumor / normal specific graphs
@erictdawson
38
39. Acknowledgements
Many thanks to:
Richard Durbin
Stephen Chanock
Erik Garrison
Adam Novak
Benedict Paten
Glenn Hickey
Jordan Eizinga
Jerven Bolleman
Maciek Smuga
Mike Lin
... and many others!
Thank y'all for listening!
@erictdawson
39