171114 best practices for benchmarking variant calls justin

Best practices for benchmarking variant calls
Justin Zook and the GA4GH Benchmarking Team
NIST Genome-Scale Measurements Group
Joint Initiative for Metrology in Biology (JIMB)
Genome in a Bottle Consortium
November 14, 2017

Take-home Messages
• Benchmarking variant calls is easy to do incorrectly
• The GA4GH Benchmarking Team has developed a set of public
tools for robust, standardized benchmarking of variant calls
• Benchmarking results should be interpreted critically
• Ongoing work on difficult variants and regions

Why are we doing this work?
• Technologies evolving rapidly
• Different sequencing and
bioinformatics methods give
different results
• Now have concordance in easy
regions, but not in difficult
regions
• Challenge:
– How do we benchmark variants in a
6 billion base-pair genome?
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432

Genome in a Bottle Consortium
Authoritative Characterization of Human Genomes
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials to
evaluate performance
• established consortium to
develop reference materials,
data, methods, performance
metrics
genericmeasurementprocess
www.slideshare.net/genomeinabottle

Bringing Principles of Metrology
to the Genome
• Reference materials
– DNA in a tube you can buy from
NIST
• Extensive state-of-the-art
characterization
– arbitrated “gold standard” calls for
SNPs, small indels
• “Upgradable” as technology
develops
• PGP genomes suitable for
commercial derived products
• Developing benchmarking tools
and software
– with GA4GH
• Samples being used to develop
and demonstrate new technology

Benchmarking the GIAB benchmarks
• Compare high-confidence calls to
other callsets and manually
inspect subset of differences
– vs. pedigree-based calls
– vs. common pipelines
– Trio analysis
• When benchmarking a new
callset against ours, most
putative FPs/FNs should actually
be FPs/FNs

Evolution of high-confidence calls
Calls
HC
Regions HC Calls
HC
indels
Concordant
with PG
NIST-
only in
beds
PG-only
in beds PG-only
Variants
Phased
v2.19 2.22 Gb 3153247 352937 3030703 87 404 1018795 0.3%
v3.2.2 2.53 Gb 3512990 335594 3391783 57 52 657715 3.9%
v3.3 2.57 Gb 3566076 358753 3441361 40 60 608137 8.8%
v3.3.2 2.58 Gb 3691156 487841 3529641 47 61 469202 99.6%
5-7
errors
in NIST
1-7
errors
in NIST
~2 FPs and ~2 FNs per million NIST variants in PG and NIST bed files

Global Alliance for Genomics and Health Benchmarking Task
Team
• Developed standardized
definitions for performance
metrics like TP, FP, and FN.
• Developing sophisticated
benchmarking tools
• Integrated into a single framework
with standardized inputs and
outputs
• Standardized bed files with
difficult genome contexts for
stratification
https://github.com/ga4gh/benchmarking-tools
Variant types can change when decomposing
or recomposing variants:
Complex variant:
chr1 201586350 CTCTCTCTCT CA
DEL + SNP:
chr1 201586350 CTCTCTCTCT C
chr1 201586359 T A
Credit: Peter Krusche, Illumina
GA4GH Benchmarking Team

Why are definitions important?
Challenges
• Genotype comparisons don’t naturally
fall into 2 categories as required for
sensitivity, precision, and specificity
• Sometimes variants are partially called
and/or partially filtered
• Clustered variants can be counted
individually or as a single complex
event
• How should filtered variants or “no-
call” sites be treated?
Example cases
• Truth is a heterozygous SNP but vcf has
a homozygous SNP
– 1 FP, 1 FN, and 1 Genotype mismatch
• Truth is an indel but vcf has a SNP at
same position
– 1 FP, 1 FN, and 1 allele mismatch
• Truth is a deletion + SNP but vcf has
the deletion only
– 1 TP and 1 FN, or 1 FP and 1-2 FNs,
depending on representations and
comparison method

Why are sophisticated comparison tools needed?
Normalization isn’t sufficient

Comparison methods affect performance metrics
• Some callers are affected by the comparison method more than
others
–Biggest effect from clustering nearby variants

GA4GH Reference Implementation
Truth VCF
Query VCF
Comparison Engine
vcfeval / vgraph / xcmp /
bcftools / ...
VCF-I
Quantification
quantify / hap.py
Stratification BED
files
Confident Call
Regions
VCF-R
Counts / ROCs
HTML Report e.g. for
precisionFDA

Workflow output
Benchmarking example: NA12878 / GiaB / 50X / PCR-Free / Hiseq2000
https://illumina.box.com/s/vjget1dumwmy0re19usetli2teucjel1
Credit: Peter Krusche, Illumina
GA4GH Benchmarking Team

Benchmarking Tools
Standardized comparison, counting, and stratification with
Hap.py + vcfeval
https://precision.fda.gov/https://github.com/ga4gh/benchmarking-tools

FN rates high in some tandem repeats
1x0.3x 10x3x 30x
11to50bp51to200bp
2bp unit repeat
3bp unit repeat
4bp unit repeat
2bp unit repeat
3bp unit repeat
4bp unit repeat
FN rate vs. average

Benchmarking stats can be difficult to interpret
Example: FN SNPs in coding regions
RefSeq Coding Regions
• Studies often focus on variants in
coding regions
• We look at FN SNP rates for bwa-GATK
using the decoy
SNP benchmarking stats vs. PG and 3.3.2
• 97.98% sensitivity vs. PG
– FNs predominately in low MQ and/or
segmental duplication regions
– ~80% of FNs supported by long or linked
reads
• 99.96% sensitivity vs. NISTv3.3.2
– 62x lower FN rate than vs PG
• As always, true sensitivity is unknown

Benchmarking stats can be difficult to interpret
Example: FN SNPs in coding regions
RefSeq Coding Regions
• Studies often focus on variants in
coding regions
• We look at FN SNP rates for bwa-GATK
using the decoy
SNP benchmarking stats vs. PG and 3.3.2
• 97.98% sensitivity vs. PG
– FNs predominately in low MQ and/or
segmental duplication regions
– ~80% of FNs supported by long or linked
reads
• 99.96% sensitivity vs. NISTv3.3.2
– 62x lower FN rate than vs PG
• As always, true sensitivity is unknown
True accuracy is hard to
estimate, especially in
difficult regions

Benchmarking against each GIAB genome
Genome Type Subset 100% -
recall
100% - precision Recall Precision Fraction of calls
outside high-conf
bed
HG001 SNP all 0.0277 0.1274 0.9997 0.9987 0.1653
HG002 SNP all 0.0664 0.1342 0.9993 0.9987 0.1910
HG003 SNP all 0.0625 0.1489 0.9994 0.9985 0.1967
HG004 SNP all 0.0633 0.1592 0.9994 0.9984 0.1975
HG005 SNP all 0.1175 0.0870 0.9988 0.9991 0.1834
HG001 SNP notinalldifficultregions 0.0096 0.0783 0.9999 0.9992 0.0491
HG001 INDEL all 0.8354 0.7458 0.9916 0.9925 0.4485
HG002 INDEL all 0.8271 0.7016 0.9917 0.9930 0.4547
HG003 INDEL all 0.7546 0.6523 0.9925 0.9935 0.4632
HG004 INDEL all 0.7345 0.6390 0.9927 0.9936 0.4592
HG005 INDEL all 0.9840 0.7418 0.9902 0.9926 0.4850
HG001 INDEL notinalldifficultregions 0.0551 0.1475 0.9994 0.9985 0.1927

Approaches to Benchmarking Variant Calling
• Well-characterized whole genome Reference Materials
• Many samples characterized in clinically relevant regions
• Synthetic DNA spike-ins
• Cell lines with engineered mutations
• Simulated reads
• Modified real reads
• Modified reference genomes
• Confirming results found in real samples over time

Challenges in Benchmarking Variant Calling
• It is difficult to do robust benchmarking of tests designed to
detect many analytes (e.g., many variants)
• Easiest to benchmark only within high-confidence bed file,
but…
• Benchmark calls/regions tend to be biased towards easier
variants and regions
– Some clinical tests are enriched for difficult sites
• Can you predict your performance for clinical variants of
interest based on sequencing reference samples?

Best Practices for Benchmarking
Benchmark sets Use benchmark sets with both high-confidence variant calls as well as high-confidence regions, so that both false negatives and
false positives can be assessed.
Stringency of
variant comparison
Determine whether it is important that the genotype match exactly, only the allele matches, or the call just needs to be near the
true variant.
Variant comparison
tools
Use sophisticated variant comparison engines such as vcfeval, xcmp, or varmatch that are able to determine if different
representations of the same variant are consistent with the benchmark call. Subsetting by high-confidence regions and, if
desired, targeted regions, should only be done after comparison to avoid problems comparing variants with differing
representations.
Manual curation Manually curate alignments, ideally from multiple data types, around at least a subset of putative false positive and false negative
calls in order to ensure they are truly errors in the user’s callset and to understand the cause(s) of errors. Report back to
benchmark set developers any potential errors found in the benchmark set (e.g., using https://goo.gl/forms/ECbjHY7nhz0hrCR52
for GIAB).
Interpretation of
metrics
All performance metrics should only be interpreted with respect to the limitations of the variants and regions in the benchmark
set. Performance metrics are likely to be lower for more difficult variant types and regions that are not fully represented in the
benchmark set, such as those in repetitive or difficult-to-map regions. When comparing methods, method 1 may perform better
in the high-confidence regions, but method 2 may perform better for more difficult variants outside the high-confidence regions.
Stratification Overall performance metrics can be useful, but for many applications it is important to assess performance for particular variant
types and genome contexts. Performance often varies significantly across variant types and genome contexts, and stratification
allows users to understand this. In addition, stratification allows users to see if some variant types and genome contexts of
interest are not sufficiently represented.
Confidence
Intervals
Confidence intervals for performance metrics such as precision and recall should be calculated. This is particularly critical for the
smaller numbers of variants found when benchmarking in targeted regions and/or less common stratified variant types and
regions.

Ongoing and Future Work
• Characterizing difficult variants and regions
– Large indels and structural variants
– Tandem repeats and homopolymers
– Difficult to map regions
– Complex variants
• New germline samples
– Additional ancestries
• Tumor/normal cell lines
– Developing IRB protocol for broadly-consented samples

Acknowledgements
• NIST/JIMB
– Marc Salit
– Jenny McDaniel
– Lindsay Vang
– David Catoe
– Lesley Chapman
• Genome in a Bottle Consortium
• GA4GH Benchmarking Team
• FDA

For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group
emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://www.nature.com/articles/sdata201625
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
Public workshops
– Next workshop Jan 25-26, 2018 at Stanford University, CA, USA
NIST/JIMB postdoc opportunities available!
Justin Zook: jzook@nist.gov
Marc Salit: salit@nist.gov

171114 best practices for benchmarking variant calls justin

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 171114 best practices for benchmarking variant calls justin

Similaire à 171114 best practices for benchmarking variant calls justin (20)

Plus de GenomeInABottle

Plus de GenomeInABottle (19)

Dernier

Dernier (20)

171114 best practices for benchmarking variant calls justin