2. AGENDA
-PacBio Sequencing Modes: Long reads (CLR) vs HiFi
-HiFi datasets available through GIAB
-Detecting variants in HiFi reads with GATK
HaplotypeCaller
-Evaluation of v4 draft benchmark
3. TWO MODES OF PACBIO SMRT SEQUENCING
Continuous Long Read
Sequencing (CLR)
consensus sequence
Long Read 1
.
.
.
.
.
.
.
Long Read n
Long reads >20 kb,
90% accuracy
4. HiFi reads ≤20 kb,
>99% accuracy
TWO MODES OF PACBIO SMRT SEQUENCING
Continuous Long Read
Sequencing (CLR)
consensus sequence
Long Read 1
.
.
.
.
.
.
.
Long Read n
Long reads >20 kb,
90% accuracy
Circular Consensus
Sequencing (CCS)
HiFi read
Subread 1
.
.
.
.
Subread n
5. HIFI READS MAP THROUGH DIFFICULT REGIONS
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74,
5463 (2019).
Short
reads
PacBio
HiFi
STRC
STRC is a congenital deafness gene that requires long reads to cover all exons.
6. PACBIO HIFI DATASETS FOR GIAB SAMPLES
Each dataset sequenced to approximately 30-fold coverage
Sample
Insert
length Platform Reads (SRA) Alignments
HG002 10 kb Sequel System https://bit.ly/2OCLeA2 https://bit.ly/2OCLeA2
HG002 15 kb Sequel System PRJNA520771 https://bit.ly/2p1ISA8
HG002 11 kb Sequel II System PRJNA527278 https://bit.ly/2VqdJm1
HG001 11 kb Sequel II System PRJNA540705 https://bit.ly/2AWtVSM
HG005 11 kb Sequel II System PRJNA540706 https://bit.ly/2ogGbuI
7. DETECTING VARIANTS IN HIFI READS WITH GATK
HAPLOTYPECALLER
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74,
5463 (2019).
HiFi reads
pbmm2
HaplotypeCaller
VariantFiltration
variant calls (vcf)
GATK4
SMRT Link
Mapping
8. DETECTING VARIANTS IN HIFI READS WITH GATK
HAPLOTYPECALLER
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74,
5463 (2019).
HiFi reads
pbmm2
HaplotypeCaller
VariantFiltration
variant calls (vcf)
GATK4
SMRT Link
Mapping
-High SNP Recall and Precision
-Lower Indel Recall and Precision, due to
1bp indel errors
9. DETECTING VARIANTS IN HIFI READS WITH GATK
HAPLOTYPECALLER
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74,
5463 (2019).
HiFi reads
pbmm2
HaplotypeCaller
VariantFiltration
variant calls (vcf)
GATK4
SMRT Link
Mapping
-High SNP Recall and Precision
-Lower Indel Recall and Precision, due to
1bp indel errors
-HaplotypeCaller optimized for error
mode of short reads
Indel
Mismatch
96.6%
PacBio HiFi
99.1%
Short reads
10. DETECTING VARIANTS IN HIFI READS WITH GATK
HAPLOTYPECALLER
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 74,
5463 (2019).
HiFi reads
pbmm2
HaplotypeCaller
VariantFiltration
variant calls (vcf)
GATK4
SMRT Link
Mapping
-High SNP Recall and Precision
-Lower Indel Recall and Precision, due to
1bp indel errors
-HaplotypeCaller optimized for error
mode of short reads
-We recommend using a caller that can
adapt to the error mode of long reads,
such as DeepVariant
(see Pi-Chuan Chang’s lightning talk)
12. MANUAL CURATION OF FP AND FN
General themes:
GATK misses or makes incorrect indel calls in homopolymer stretches
GATK false positives due to mis-mapped LINE elements and segmental duplications
GATK false negatives due to low coverage depth or mapping quality
15
19
2 3
1
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Putative FN
Putative FP
Manually Curated Discordant Variants
Benchmark Correct GATK Callset Correct Unsure
Opportunities to improve variant calling:
-incorrect indel calls in homopolymer stretches (FP + FN)
-mis-mapped LINE elements and segmental duplications (FP)
-low mapping quality (FN)
13. FN IN CALLSET - UNSURE ABOUT BENCHMARK
Benchmark - homozygous T➔A
A/A
A/TA
T/A
A/TA
Illumina
PacBio HiFi
ONT
10X
14. FP IN CALLSET - UNSURE ABOUT BENCHMARK
Illumina
PacBio HiFi
ONT
10X
no coverage
C/T
C/T
C/T (odd allele frequency)
Benchmark - no call
15. FP IN CALLSET - UNSURE ABOUT BENCHMARK (CONT’D)
Illumina
PacBio HiFi
ONT
10X
16. Illumina
FP + FN IN CALLSET - BENCHMARK INCORRECT FOR STR
CONTRACTION
Benchmark - GGAG⨯9 deletion
low coverage
GGAG⨯2 deletion
~GGAG⨯2 deletion
GGAG⨯2 deletion
PacBio HiFi
ONT
10X
17. CONCLUSIONS
-v4 draft benchmark satisfies GIAB goal for GATK calls on HiFi reads:
-75% of putative FN and 95% of putative FP are clearly errors in the GATK callset
-Suggestions for improving the benchmark:
-Exclude regions with SNV disagreements between long/linked read datasets or odd
SNV frequencies (2:1, 3:1) in long/linked read datasets
-Require support from long reads for indels in repetitive regions with low short read
coverage