3. About me
Worked at the UC Berkeley AMPlab for about a year
Currently the primary SMaSH developer
Starting a CS PhD at Berkeley in Programming Languages this fall
9. Codebase
For benchmarking purposes, we compare a predicted callset against a
ground truth callset
Comparing two predicted callsets works exactly the same.
11. Evaluation
SNPs and indels are strictly evaluated.
Structural variants are evaluated on:
Same type (insertion/deletion/other)
Length same as true variant within specified tolerance
Position same as true variant within specified tolerance
14. The VCF format is
ambiguous!
SMaSH addresses this problem with two strategies:
Normalization
Rescue
Guiding principle: metrics should never be worse after
normalization/rescue than they were without them.
17. First, we remove the longest proper suffix from the ref and alt alleles.
18. Then, we "slide" the variants by adding a base from the reference to the
head and removing a base from the tail, until the last bases on both
alleles are no longer the same.
19. Rescue
The same underlying haplotype can be represented by different sets of
variants.
True callset
Predicted callset
20. Rescue Algorithm
For every false negative, we attempt rescue:
Build up a window around the variant positive for the true and
predicted callsets
For all sets of non-overlapping variants, expand the underlying
haplotypes for the variants within those windows.
If the haplotypes match, mark all false negatives/false positives as
true positives.
22. Outputs
Statistics, including counts for all categories, in plain text, TSV and
JSON formats
Calculations for precision and recall, including error bars
VCF containing variants from both callsets, annotated with the callset
they came from and their categorization (TP/FP/FN/rescued)
26. Feature Roadmap
New variant types: complex variants, compound heterozygous variants,
etc.
Phasing evaluation
Better handling of known false positives
30. Datasets
The SMaSH paper proposed eight datasets, including synthetic, sampled
human, and mouse.
Other data to use as ground truth?
NIST pedigree calls for NA12878
the Illumina Platinum Genome
Others?