The document summarizes efforts to expand the Genome in a Bottle (GIAB) small variant benchmark using long and linked reads. Key points:
1) PacBio CCS and 10X Genomics data were used to add variants to the benchmark, mostly in regions difficult to map with short reads. This expanded coverage of variants and reference bases.
2) An initial evaluation found the majority of false positives and false negatives in tested variant callsets were correct in the benchmark, suggesting errors were in the callsets rather than the benchmark.
3) Refinements to the benchmark were identified, including excluding certain regions, to improve accuracy for the next version. The expanded benchmark improves evaluation of variant callers in difficult genomic
Call Girls Service in Virugambakkam - 7001305949 | 24x7 Service Available Nea...
Using long reads to expand Genome in a Bottle small variant benchmark
1. Using long and linked reads to generate
a new Genome in a Bottle small variant
benchmark
Justin Wagner, Andrew Carroll, Ian T. Fiddes, Aaron M. Wenger, William J.
Rowell, Nathan Olson, Lindsey Harris, Jenny McDaniel, Xin Zhou, Sergey
Aganezov, Melanie Kirsche, Bohan Ni, Samantha Zarate, Byunggil Yoo, Neil
Miller, C. Xiao, Marc Salit, Justin Zook, Genome in a Bottle Consortium
GRC/GIAB Workshop ASHG 2019
2. Overview
• v3.3.2 benchmark variants and regions cover 87.84% of assembled
bases in chromosomes 1-22 in GRCh37 for the sample HG002
• Short read variant callers perform poorly in genomic locations with
high homology such as segmental duplications and low-complexity
repeat-rich regions
• Now utilizing PacBio CCS and 10X Genomics data to expand the GIAB
benchmark regions and reduce errors in current regions
• Long and linked reads add variants to the benchmark, mostly in
regions difficult to map with short reads
• GRCh37: 276,840 SNPs and 53,482 INDELs
• GRCh38: 286,483 SNPs and 42,980 INDELs
4. When do we trust variants and regions from
each method
Variants
PASS
Filtered outliers
Low/high coverage or low
MQ (or low GQ for gVCF)
Difficult regions/SVs
Callable regions
TR
VariantCallingMethodX
(1) (2) (3)
1/1
0/1
5. Arbitrating between variant calls in different
methods
PASS variants #2
Benchmark regions
0/1 1/11/1
Benchmark calls 0/11/1
Callable regions #2
Callable regions #1
1/10/11/1PASS variants #1
InputMethods
1/1
(1)
Concordant
(2)
Discordant
unresolved
(3)
Discordant
arbitrated
(4)
Concordant
not callable
7. Long and linked reads cover more variants
and regions
Variants
PASS
Filtered outliers
Low/high coverage or low
MQ (or low GQ for gVCF)
Difficult regions/SVs
Callable regions
TR
VariantCallingMethodX
(1) (2) (3)
1/1
0/1
10x Genomics and PacBio CCS data add new variants (1), regions with good
coverage of high MQ reads (2), and access to difficult regions (3)
9. Difficult Regions Excluded from all Methods
Difficult Region Description Bases Covered
in GRCh37
Bases Covered
in GRCh38
v0.6 SV GIAB Benchmark 32,596,754 32,872,907
Potential copy number variation 51,713,344 62,666,746
Tandem Repeats > 10kb 5,731,885 71,942,255
Highly similar and high depth segmental duplications 1,232,701 2,094,143
Regions that are collapsed and expanded from GRCh37/38
Primary Assembly Alignments 17,979,597 N/A
Modeled centromere and heterochromatin N/A 62,304,573
10. Difficult Regions Excluded by Method
• Tandem Repeats < 51bp except GATK from Illumina PCR-free, Complete
Genomics, and CCS DeepVariant
• Tandem Repeats > 51bp and < 200bp except GATK from Illumina PCR-
Free and CCS DeepVariant
• Tandem Repeats > 200bp except CCS DeepVariant
• Homopolymers > 6bp except GATK from Illumina PCR-free, Complete
Genomics, Ion Exome, PacBio CCS
• Imperfect homopolymer > 10bp except GATK from Illumina PCR-Free
• Difficult to map regions for short reads except 10x and CCS
• LINE:L1Hs > 500bp except Illumina MatePair, 10x, and CCS
• Segmental duplications except 10x and CCS
11. v4 draft benchmark includes variants found
with haplotype-resolved assembly of MHC
• Worked with a team from the March 2019 NCBI Pangenome
Hackathon to generate haplotype-resolved assembly of MHC region
(chr6:28,477,797-33,448,354 in GRCh37)
• Use assembly to call small variants
• Small variants from assembly are integrated with mapping-based calls
in the MHC region for v4 draft benchmark
• v4 draft benchmark includes 23,229 variants in the MHC region
• Covers most HLA genes and CYP21A2/TNXA/TNXB
12. v4 draft benchmark include more bases,
variants, and segmental duplications
v4 draft GRCh37 v4 draft GRCh38
Base pairs 2,504,027,936 2,509,269,277
Reference
covered
93.2% 91.03%
SNPs 3,323,773 3,314,941
Indels 519,152 519,494
Base pairs in
Segmental
Duplications
64,300,499 73,819,342
80.00%
85.00%
90.00%
95.00%
Percent of reference covered
13. Some variants and segmental duplications
only covered in v3.3.2 or v4 draft
Only in v3.3.2
GRCh37
Only in v4
draft GRCh37
SNPs INDELs SNPs INDELs
Only in v3.3.2
GRCh38
Only in v4
draft GRCh38343,358
69,495
77,324
23,828
376,653
91,837
91,719
48,753
Segmental Duplications Segmental Duplications
25,445
63,949,151
1,928,353
70,187,985
14. v4 draft enables benchmarking in regions
difficult for short reads
Comparison of Illumina RTG VCF against benchmark sets
• SNP FNs increase by a factor of more than 3, mostly due to new
benchmark variants in difficult to map regions and segmental
duplications
• False negatives: variants present in the truth set, but missed in the query
Subset v3.3.2 FNs v4 draft FNs
All SNPs 8,594 30,229
Low mappability 6,708 25,295
Segmental duplications 1,429 14,008
15. v4 draft benchmark contains more medically-
relevant variants
• v4 draft covers more of the MHC region
• Outside of MHC updates, top 5 genes with variants increased from v3.3.2
to v4 draft benchmark: TSPEAR (31), LAMA5 (28), FCGBP (18), TPSAB1
(15), HSPG2 (13)
• PMS2 from ACMG59 has 2 more variants and RET, SCN5A, TNNI3 have 1
more variant covered in v4 draft benchmark that are not in v3.3.2
Variants in Medical Exome
(genes from OMIM, HGMD, ClinVar, UniProt)
Benchmark Regions v3.3.2 8,209
Benchmark Regions v4 draft 9,527
16. Sanger sequencing confirms medically-
relevant variants
• Performed long range PCR
before sequencing
• Confirmed 12 variants in
CYP21A2, which is a medically-
relevant gene in the MHC region
• Confirmed 6 variants in PMS2
• Confirmed 15 variants in 5 other
genes
17. Evaluation by GIAB collaborators
Compared benchmark to callsets from a variety of technologies and
variant calling methods including:
• Illumina PCR-Free and Dragen
• PacBio CCS and GATK4
• PacBio CCS and DeepVariant
• PacBio CCS and Clair (Next generation of Clairvoyante)
• ONT Promethion and Clair
Preliminary results suggest that a majority of FPs and FNs are correct in
the benchmark and errors in the tested callsets
More
volunteers
welcomed
18. Manual curation by callset developers
Process
• Compare callset to benchmark using
hap.py and/or vcfeval
• Randomly select 5 FP SNPs, 5 FN SNPs, 5
FP indels and 5 FN indels, each from
inside and outside the v3.3.2 benchmark
bed, in GRCh37 and GRCh38
(5*4*2*2=80 total)
• Use IGV with PCR-free Illumina, PacBio
CCS, 10x, and ONT + difficult bed files
Questions to ask
• Are both alleles correct in the
benchmark?
• Yes/No/Unsure
• Are both alleles correct in the callset
being tested?
• Yes/No/Unsure
• If the benchmark is wrong or
questionable, how did you make this
determination?
• Instructions: Be critical of the benchmark,
and select unsure if the evidence does
not strongly support the benchmark
being correct
19. Process for independent evaluations
Callset developer
curates putative
errors
Benchmark is
wrong or
questionable
NIST curator
disagrees
Discuss with
callset developer
NIST curator
agrees
Classify source of
potential error in
benchmark
Benchmark is
correct
No further
curation
20. Initial evaluation suggest a majority of FPs and FNs
are correct in the benchmark and errors in the
tested callsets
Platform and Caller Number
Benchmark
Correct
Number
Benchmark
Unsure
Benchmark is not
correct
Comparison
callset is not
correct
Total sites
CCS with GATK GRCh37 FP 19 1 0 19 20
CCS with GATK GRCh37 FN 15 3 2 18 20
ONT with Clair GRCh37 FP 33 1 0 34 34
ONT with Clair GRCh37 FN 27 3 0 30 30
CCS with Clair GRCh37 FP 7 13 0 6 20
CCS with Clair GRCh37 FN 19 1 0 19 20
Illumina with Dragen GRCh37 FP 14 6 0 11 20
Illumina with Dragen GRCh37 FN 17 3 0 17 20
25. Potential refinements identified for v4.1
• Exclude VDJ
• Exclude Inversions
• Improve CNV coverage
• Use ONT for excessive coverage
• Explore smoothing on excessive coverage beds
• Use new diploid assemblies to identify CNVs
• MHC
• Exclude CNVs in the MHC, partial repeats in MHC, small regions that are questionable in the
DRB genes
• Benchmark regions density
• Regions with dense variation and many gaps in bed
• Dense variants near SVs
• Segmental duplications
• Small region of duplication covered by benchmark
• Containing an SV
26. Conclusions
• Long and linked reads add variants to the benchmark, mostly in
regions difficult to map with short reads
• GRCh37: 276,840 SNPs and 53,482 INDELs
• GRCh38: 286,483 SNPs and 42,980 INDELs
• v4 draft benchmark is available for GRCh37 and GRCh38
• GRCh37 Percent Chromosomes 1-22 Covered: 93.2%
• GRCh38 Percent Chromosomes 1-22 Covered: 91.03%
• Initial evaluation suggest a majority of FPs and FNs are correct in the
benchmark and errors in the tested callsets
• More volunteers welcomed
• Identified refinements for v4.1
27. On-going and Future Work
• Refine use of genome stratifications
• Adding variant calls from raw PacBio and Oxford Nanopore
• Improve benchmark for larger indels, homopolymers, and tandem
repeats
• Improve normalization of complex variants
• Generating benchmark variants from diploid assemblies
• Machine learning
• Outlier detection, active learning
• Generate v4 draft for other GIAB genomes
28. Acknowledgements
• Andrew Carroll
• Ian T. Fiddes
• Aaron M. Wenger
• William J. Rowell
• Nathan Olson
• Lindsey Harris
• Jenny McDaniel
• Chunlin Xiao
• Marc Salit
• Justin Zook
• Genome in a Bottle Consortium
Draft Benchmark Evaluators
• Xin Zhou
• Sergey Aganezov
• Melanie Kirsche
• Bohan Ni
• Samantha Zarate
• Byunggil Yoo
• Neil Miller
30. Initial evaluation suggest a majority of FPs and FNs
are correct in the benchmark and errors in the
tested callsets
Platform and Caller Number
Benchmark
Correct
Number
Benchmark
Unsure
Benchmark is not
correct
Comparison
callset is not
correct
Total sites
CCS with GATK GRCh38 FP 16 4 0 16 20
CCS with GATK GRCh38 FN 17 3 0 16 20
ONT with Clair GRCh38 FP 19 1 0 19 20
ONT with Clair GRCh38 FN 14 6 0 19 20
CCS with Clair GRCh38 FP 15 5 0 16 20
CCS with Clair GRCh38 FN 18 2 0 20 20
Illumina with Dragen GRCh38 FP 16 3 1 16 20
Illumina with Dragen GRCh38 FN 18 2 0 18 20
31. Integration Pipeline Process
Find sensitive
variant calls and
callable regions
for each dataset,
excluding
difficult
regions/SVs that
are problematic
for each type of
data and variant
caller
Find
“consensus”
calls with
support from
2+
technologies
(and no other
technologies
disagree) using
callable
regions
Use “consensus”
calls to train simple
one-class model for
each dataset and
find “outliers” that
are less trustworthy
for each dataset
Find
benchmark
calls by using
callable
regions and
“outliers” to
arbitrate
between
datasets when
they disagree
Find
benchmark
regions by
taking
union of
callable
regions and
subtracting
uncertain
variants
33. Initial evaluation shows a majority of FPs and FNs
are correct in the benchmark and errors in the
tested callsets
Platform and Caller Number
Benchmark
Correct
Number
Benchmark
Unsure
Benchmark is not
correct
Comparison
callset is not
correct
Total sites
CCS with DeepVariant GRCh37 FP 3 9 8 20
CCS with DeepVariant GRCh37 FN 17 3 0 20
CCS with GATK GRCh37 FP 19 1 0 19 20
CCS with GATK GRCh37 FN 15 3 2 18 20
ONT with Clair GRCh37 FP 33 1 0 34 34
ONT with Clair GRCh37 FN 27 3 0 30 30
CCS with Clair GRCh37 FP 7 13 0 6 20
CCS with Clair GRCh37 FN 19 1 0 19 20
Illumina with Dragen GRCh37 FP 14 6 0 11 20
Illumina with Dragen GRCh37 FN 17 3 0 17 20
34. Initial evaluation shows a majority of FPs and FNs
are correct in the benchmark and errors in the
tested callsets
Platform and Caller Number
Benchmark
Correct
Number
Benchmark
Unsure
Benchmark is not
correct
Comparison
callset is not
correct
Total sites
CCS with DeepVariant GRCh38 FP 6 7 7 20
CCS with DeepVariant GRCh38 FN 20 0 0 20
CCS with GATK GRCh38 FP 16 4 0 16 20
CCS with GATK GRCh38 FN 17 3 0 16 20
ONT with Clair GRCh38 FP 19 1 0 19 20
ONT with Clair GRCh38 FN 14 6 0 19 20
CCS with Clair GRCh38 FP 15 5 0 16 20
CCS with Clair GRCh38 FN 18 2 0 20 20
Illumina with Dragen GRCh38 FP 16 3 1 16 20
Illumina with Dragen GRCh38 FN 18 2 0 18 20
35. Initial evaluation shows a majority of FPs and FNs
are correct in the benchmark and errors in the
tested callsets
Platform and Caller Number Benchmark
Correct
Number Benchmark
Unsure/No
Number Callset
Incorrect
CCS with GATK GRCh37 32 8 32
CCS with GATK GRCh38 33 7 32
ONT with Clair GRCh37 60 4 60
CCS with Clair GRCh37 26 14 24
CCS with Clair GRCh38 33 7 36
Illumina with Dragen GRCh37 31 9 28
Illumina with Dragen GRCh38 34 6 34
Notes de l'éditeur
Exclude tandem repeats approximately larger than the read length for each method
Homopolymers are excluded from 10x and PacBio CCS
Really long homopolymers only included for GATK based calls for PCR-Free data because GATK gVCF has low genotype quality score if they don’t have reads that totally encompass the homopolymer
- Trust homopolymers most from PCR-Free short reads
Ongoing work includes checking if many are in regions that might be in potential CNVs as they could be errors in v3.3.2
false-negatives (FN) : variants present in the truth set, but missed in the query.
3_79181930
Add this from what lindsey sent on slack
Combine GRCh37 and GRCh38
Left is an inversion
Right is an likely a LINE-mediated inversion
- If have an inversion near repetitive elements, then exclude the repetitive elements as well
- Show just two LINEs and the inversion they flank
Left is likely a tandem duplication or large insertion or complex insertion
Right is an inversion but then deletion that is in SV benchmark, likely a complex SV
Update this table – Includes Billy’s new results
10x-Aquila_37
16
24
16
10x-Aquila_38
22
18
17