Using long reads to expand Genome in a Bottle small variant benchmark

Using long and linked reads to generate
a new Genome in a Bottle small variant
benchmark
Justin Wagner, Andrew Carroll, Ian T. Fiddes, Aaron M. Wenger, William J.
Rowell, Nathan Olson, Lindsey Harris, Jenny McDaniel, Xin Zhou, Sergey
Aganezov, Melanie Kirsche, Bohan Ni, Samantha Zarate, Byunggil Yoo, Neil
Miller, C. Xiao, Marc Salit, Justin Zook, Genome in a Bottle Consortium
GRC/GIAB Workshop ASHG 2019

Overview
• v3.3.2 benchmark variants and regions cover 87.84% of assembled
bases in chromosomes 1-22 in GRCh37 for the sample HG002
• Short read variant callers perform poorly in genomic locations with
high homology such as segmental duplications and low-complexity
repeat-rich regions
• Now utilizing PacBio CCS and 10X Genomics data to expand the GIAB
benchmark regions and reduce errors in current regions
• Long and linked reads add variants to the benchmark, mostly in
regions difficult to map with short reads
• GRCh37: 276,840 SNPs and 53,482 INDELs

How the benchmark is generated

When do we trust variants and regions from
each method
Variants
PASS
Filtered outliers
Low/high coverage or low
MQ (or low GQ for gVCF)
Difficult regions/SVs
Callable regions
TR
VariantCallingMethodX
(1) (2) (3)
1/1
0/1

Arbitrating between variant calls in different
methods
PASS variants #2
Benchmark regions
0/1 1/11/1
Benchmark calls 0/11/1
Callable regions #2
Callable regions #1
1/10/11/1PASS variants #1
InputMethods
1/1
(1)
Concordant
(2)
Discordant
unresolved
(3)
Discordant
arbitrated
(4)
Concordant
not callable

Sequencing data used in integration for
HG002
Platform Characteristics Alignment; Variant Calling
Illumina 150x150bp, ~300x coverage Novoalign; GATK v3.5
CG 26x26bp; ~100x coverage Complete Genomics Pipeline
Illumina 150x150bp, ~300x coverage Novoalign; Freebayes
Illumina 250x250bp;~45x coverage Novoalign; GATK v3.5
Illumina 250x250bp;~45x coverage Novoalign; Freebayes
Illumina 6Kbp mate pair; ~13x coverage bwa_mem; GATK v3.5
Illumina 6Kbp mate pair; ~13x coverage bwa_mem; Freebayes
Ion Exome, 1000x coverage Torrent Suite v4.2; Torrent Variant Caller v4.4
Solid 75bp; ~60x coverage LifeScope v2.5.1; GATK v3.5
PacBio CCS Sequel II ~11kb reads; ~32x coverage minimap2; GATK4
PacBio CCS Sequel II ~11kb reads; ~32x coverage minimap2; DeepVariant v0.8
10x Genomics Linked reads; ~84x coverage LongRanger Pipeline

Long and linked reads cover more variants
and regions
Variants
PASS
Filtered outliers
Low/high coverage or low
MQ (or low GQ for gVCF)
Difficult regions/SVs
Callable regions
TR
VariantCallingMethodX
(1) (2) (3)
1/1
0/1
10x Genomics and PacBio CCS data add new variants (1), regions with good
coverage of high MQ reads (2), and access to difficult regions (3)

Difficult Regions Excluded from all Methods
Difficult Region Description Bases Covered
in GRCh37
Bases Covered
in GRCh38
v0.6 SV GIAB Benchmark 32,596,754 32,872,907
Potential copy number variation 51,713,344 62,666,746
Tandem Repeats > 10kb 5,731,885 71,942,255
Highly similar and high depth segmental duplications 1,232,701 2,094,143
Regions that are collapsed and expanded from GRCh37/38
Primary Assembly Alignments 17,979,597 N/A
Modeled centromere and heterochromatin N/A 62,304,573

Difficult Regions Excluded by Method
• Tandem Repeats < 51bp except GATK from Illumina PCR-free, Complete
Genomics, and CCS DeepVariant
• Tandem Repeats > 51bp and < 200bp except GATK from Illumina PCR-
Free and CCS DeepVariant
• Tandem Repeats > 200bp except CCS DeepVariant
• Homopolymers > 6bp except GATK from Illumina PCR-free, Complete
Genomics, Ion Exome, PacBio CCS
• Imperfect homopolymer > 10bp except GATK from Illumina PCR-Free
• Difficult to map regions for short reads except 10x and CCS
• LINE:L1Hs > 500bp except Illumina MatePair, 10x, and CCS
• Segmental duplications except 10x and CCS

v4 draft benchmark includes variants found
with haplotype-resolved assembly of MHC
• Worked with a team from the March 2019 NCBI Pangenome
Hackathon to generate haplotype-resolved assembly of MHC region
(chr6:28,477,797-33,448,354 in GRCh37)
• Use assembly to call small variants
• Small variants from assembly are integrated with mapping-based calls
in the MHC region for v4 draft benchmark
• v4 draft benchmark includes 23,229 variants in the MHC region
• Covers most HLA genes and CYP21A2/TNXA/TNXB

v4 draft benchmark include more bases,
variants, and segmental duplications
v4 draft GRCh37 v4 draft GRCh38
Base pairs 2,504,027,936 2,509,269,277
Reference
covered
93.2% 91.03%
SNPs 3,323,773 3,314,941
Indels 519,152 519,494
Base pairs in
Segmental
Duplications
64,300,499 73,819,342
80.00%
85.00%
90.00%
95.00%
Percent of reference covered

Some variants and segmental duplications
only covered in v3.3.2 or v4 draft
Only in v3.3.2
GRCh37
Only in v4
draft GRCh37
SNPs INDELs SNPs INDELs
Only in v3.3.2
GRCh38
Only in v4
draft GRCh38343,358
69,495
77,324
23,828
376,653
91,837
91,719
48,753
Segmental Duplications Segmental Duplications
25,445
63,949,151
1,928,353
70,187,985

v4 draft enables benchmarking in regions
difficult for short reads
Comparison of Illumina RTG VCF against benchmark sets
• SNP FNs increase by a factor of more than 3, mostly due to new
benchmark variants in difficult to map regions and segmental
duplications
• False negatives: variants present in the truth set, but missed in the query
Subset v3.3.2 FNs v4 draft FNs
All SNPs 8,594 30,229
Low mappability 6,708 25,295
Segmental duplications 1,429 14,008

v4 draft benchmark contains more medically-
relevant variants
• v4 draft covers more of the MHC region
• Outside of MHC updates, top 5 genes with variants increased from v3.3.2
to v4 draft benchmark: TSPEAR (31), LAMA5 (28), FCGBP (18), TPSAB1
(15), HSPG2 (13)
• PMS2 from ACMG59 has 2 more variants and RET, SCN5A, TNNI3 have 1
more variant covered in v4 draft benchmark that are not in v3.3.2
Variants in Medical Exome
(genes from OMIM, HGMD, ClinVar, UniProt)
Benchmark Regions v3.3.2 8,209
Benchmark Regions v4 draft 9,527

Sanger sequencing confirms medically-
relevant variants
• Performed long range PCR
before sequencing
• Confirmed 12 variants in
CYP21A2, which is a medically-
relevant gene in the MHC region
• Confirmed 6 variants in PMS2
• Confirmed 15 variants in 5 other
genes

Evaluation by GIAB collaborators
Compared benchmark to callsets from a variety of technologies and
variant calling methods including:
• Illumina PCR-Free and Dragen
• PacBio CCS and GATK4
• PacBio CCS and DeepVariant
• PacBio CCS and Clair (Next generation of Clairvoyante)
• ONT Promethion and Clair
Preliminary results suggest that a majority of FPs and FNs are correct in
the benchmark and errors in the tested callsets
More
volunteers
welcomed

Manual curation by callset developers
Process
• Compare callset to benchmark using
hap.py and/or vcfeval
• Randomly select 5 FP SNPs, 5 FN SNPs, 5
FP indels and 5 FN indels, each from
inside and outside the v3.3.2 benchmark
bed, in GRCh37 and GRCh38
(5*4*2*2=80 total)
• Use IGV with PCR-free Illumina, PacBio
CCS, 10x, and ONT + difficult bed files
Questions to ask
• Are both alleles correct in the
benchmark?
• Yes/No/Unsure
• Are both alleles correct in the callset
being tested?
• Yes/No/Unsure
• If the benchmark is wrong or
questionable, how did you make this
determination?
• Instructions: Be critical of the benchmark,
and select unsure if the evidence does
not strongly support the benchmark
being correct

Process for independent evaluations
Callset developer
curates putative
errors
Benchmark is
wrong or
questionable
NIST curator
disagrees
Discuss with
callset developer
NIST curator
agrees
Classify source of
potential error in
benchmark
Benchmark is
correct
No further
curation

Initial evaluation suggest a majority of FPs and FNs
are correct in the benchmark and errors in the
tested callsets
Platform and Caller Number
Benchmark
Correct
Number
Benchmark
Unsure
Benchmark is not
correct
Comparison
callset is not
correct
Total sites
CCS with GATK GRCh37 FP 19 1 0 19 20
CCS with GATK GRCh37 FN 15 3 2 18 20
ONT with Clair GRCh37 FP 33 1 0 34 34
ONT with Clair GRCh37 FN 27 3 0 30 30
CCS with Clair GRCh37 FP 7 13 0 6 20
CCS with Clair GRCh37 FN 19 1 0 19 20
Illumina with Dragen GRCh37 FP 14 6 0 11 20
Illumina with Dragen GRCh37 FN 17 3 0 17 20

Evaluation FPs – Inversions
LINEs

Evaluation FPs – Complex SVs

Evaluation FPs – Near low coverage

Potential refinements identified for v4.1
• Exclude VDJ
• Exclude Inversions
• Improve CNV coverage
• Use ONT for excessive coverage
• Explore smoothing on excessive coverage beds
• Use new diploid assemblies to identify CNVs
• MHC
• Exclude CNVs in the MHC, partial repeats in MHC, small regions that are questionable in the
DRB genes
• Benchmark regions density
• Regions with dense variation and many gaps in bed
• Dense variants near SVs
• Segmental duplications
• Small region of duplication covered by benchmark
• Containing an SV

Conclusions
• Long and linked reads add variants to the benchmark, mostly in
regions difficult to map with short reads
• v4 draft benchmark is available for GRCh37 and GRCh38
• GRCh37 Percent Chromosomes 1-22 Covered: 93.2%
• GRCh38 Percent Chromosomes 1-22 Covered: 91.03%
• Initial evaluation suggest a majority of FPs and FNs are correct in the
benchmark and errors in the tested callsets
• More volunteers welcomed
• Identified refinements for v4.1

On-going and Future Work
• Refine use of genome stratifications
• Adding variant calls from raw PacBio and Oxford Nanopore
• Improve benchmark for larger indels, homopolymers, and tandem
repeats
• Improve normalization of complex variants
• Generating benchmark variants from diploid assemblies
• Machine learning
• Outlier detection, active learning
• Generate v4 draft for other GIAB genomes

Acknowledgements
• Andrew Carroll
• Ian T. Fiddes
• Aaron M. Wenger
• William J. Rowell
• Nathan Olson
• Lindsey Harris
• Jenny McDaniel
• Chunlin Xiao
• Marc Salit
• Justin Zook
• Genome in a Bottle Consortium
Draft Benchmark Evaluators
• Xin Zhou
• Sergey Aganezov
• Melanie Kirsche
• Bohan Ni
• Samantha Zarate
• Byunggil Yoo
• Neil Miller

Initial evaluation suggest a majority of FPs and FNs
tested callsets
Benchmark
Correct
Number
Benchmark
Unsure
Benchmark is not
correct
Comparison
callset is not
correct
Total sites

Integration Pipeline Process
Find sensitive
variant calls and
callable regions
for each dataset,
excluding
difficult
regions/SVs that
are problematic
for each type of
data and variant
caller
Find
“consensus”
calls with
support from
2+
technologies
(and no other
technologies
disagree) using
callable
regions
Use “consensus”
calls to train simple
one-class model for
each dataset and
find “outliers” that
are less trustworthy
for each dataset
Find
benchmark
calls by using
callable
regions and
“outliers” to
arbitrate
between
datasets when
they disagree
Find
benchmark
regions by
taking
union of
callable
regions and
subtracting
uncertain
variants

Initial evaluation shows a majority of FPs and FNs
tested callsets
Benchmark
Correct
Number
Benchmark
Unsure
Benchmark is not
correct
Comparison
callset is not
correct
Total sites
CCS with DeepVariant GRCh37 FP 3 9 8 20
CCS with DeepVariant GRCh37 FN 17 3 0 20

tested callsets
Benchmark
Correct
Number
Benchmark
Unsure
Benchmark is not
correct
Comparison
callset is not
correct
Total sites
CCS with DeepVariant GRCh38 FP 6 7 7 20
CCS with DeepVariant GRCh38 FN 20 0 0 20

tested callsets
Platform and Caller Number Benchmark
Correct
Number Benchmark
Unsure/No
Number Callset
Incorrect
CCS with GATK GRCh37 32 8 32
CCS with GATK GRCh38 33 7 32
ONT with Clair GRCh37 60 4 60
CCS with Clair GRCh37 26 14 24
CCS with Clair GRCh38 33 7 36
Illumina with Dragen GRCh37 31 9 28
Illumina with Dragen GRCh38 34 6 34

Using long reads to expand Genome in a Bottle small variant benchmark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Using long reads to expand Genome in a Bottle small variant benchmark

Similaire à Using long reads to expand Genome in a Bottle small variant benchmark (20)

Plus de GenomeInABottle

Plus de GenomeInABottle (11)

Dernier

Dernier (20)

Using long reads to expand Genome in a Bottle small variant benchmark

Notes de l'éditeur