SQL Database Design For Developers at php[tek] 2024
140127 rtg phased pedigree analyses
1. Development & applications of a
segregation-phasing ground truth
GENOME- IN- A- BOTTLE W ORKSHOP
Francisco M. De La Vega, D.Sc.
Visiting Scholar, Department of Genetics
Stanford University School of Medicine
In collaboration with Real Time Genomics, Inc.
2. Evaluating Variant Calls
O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and
genome sequencing. Genome Medicine 5, 28 (2013).
3. Beyond Venn Diagrams
Experimental validation (e.g. Sanger, qPCR)
Expensive
Limited by platform success
Statistical sample
Reference orthogonal data available for some genomes
SNP array data
Sparse fosmid sequencing data
Incomplete
Reference genomes sequenced by multiple platforms
Arbitration methods (e.g. NIST, Genome-in-a-Bottle)
Low FP, but unknown FN (genome-wide)
Biases?
8. Steps for haplotype phasing in large family
Identify crossovers
Phase contiguity extension
Connect haplotype islands
Check calls vs haplotype framework
14. Probability of a set of genotypes being phase-consistent
by chance
Given that there are d different genotypes across both the parents and
children and that the number of times each of these genotypes occurs is ni
and
, then the probability is:
Cleary, J. G., et al. Joint variant and de novo mutation identification on pedigrees from high-throughput
sequencing data. bioRxiv (2014). doi:10.1101/001958
15. Probability of a set of genotypes being phase-consistent
by chance – some examples
Genotype Counts
0/0
0/1
1/1
0/2
1/2
13
Probability
1
13
3.01x10-1
6
7
1.01x10-2
1
12
1.11x10-1
1
11
1
1.36x10-2
4
4
5
5.53x10-4
3
3
3
4
6.13x10-5
1
3
3
12
3.68x10-1
1
5
6
1
2.75x10-4
1
11
13
1
7.46x10-2
16. Phasing consistent variants
Illumina 2x100 bp 50X WGS Data, RTG Trio Calls
Raw
Call Set
AVR >0.15
n
%
n
%
Phase consistent
5,224,138
77.35
4,606,574
99.28
Phase inconsistent
1,329,189
19.68
13,951
0.30
200,450
2.96
19,197
0.41
6,753,777
99.99
4,639,722
99.99
Repaired
Calls inside
phased segments
Y-chromosome excluded
17. Phasing consistent variants
Illumina 2x100 bp 50X WGS Data, BWA/GATK UG v1.7 Calls
VQSR 1st Tranche
Raw
Call Set
n
%
n
%
Phase consistent
6,941,213
68.34
5,863,035
96.00
Phase inconsistent
2,263,975
22.29
184,169
3.01
951,682
9.36
59,592
0.97
10,156,870
99.53
6,106,796
99.98
Repaired
Calls inside
phased segments
Y-chromosome excluded
21. 21
Assessment of MNP & indel calling (rtgVariant 1.0)
Deletions
Insertions
•
•
•
In rtgVariant 1.0,
longer insertions
have higher FP than
small and deletions.
More FP in MNP
Improvements in
aligner for v1.2
SNV/MNPs
0.5%
Percentage of phase inconsistent calls
rtgVariant v 1.0; NA12878
22. Summary & Perspectives
• Genetic segregation in a large family offers a unique
opportunity to identify “true” sets of variants
• Requires collecting data for whole family as new
chemistries and platforms become available (e.g.
2x250bp, Moleculo reads)
• Data from multiple platforms can be merged to create
a comprehensive phase-consistent ground truth
• Allows rational assessment of variant pipelines and
improvement of algorithms
• Some issues that need to be dealt with: cell line
artifacts, CNVs, systematic errors, SVs.
23. rtgTools v1.0
A toolkit to compare and analyze VCFs
•
•
•
•
•
•
•
vcfeval – comparison of VCFs for ROC curves
rocplot – draw ROC curves from vcfeval output
medelian – counts of Mendelian inheritance errors in pedigrees
vcfstats – basic statistics of VCF files
vcffilter – filtering of VCFs by scores, etc.
vcfannotate – annotation of VCF files
vcfmerge – merge VCF files
Java compiled code freely available at GiaB repository:
ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/