2. 2
Platinum Genome project: Improving technology & tools
Create a catalogue of highly accurate whole-genome variant calls within a well
characterized pedigree
– SNPs, indels & CNVs
– Including highly confident reference positions
– Provide direct supporting evidence for every variant call
Develop a framework to assess variant callers
Provide a path to improve variant callers by providing a better truth data to
sensitively assess sensitivity and precision
– Modifying the SNP filters to maximize accuracy
Correct FPFN
Truth Test
3. 3
NIST GIAB – Pedigree analysis
12889 12890 12891 12892
12877 12878
12879 12880 12881 12882 12883 12884 12885 1288712886 12888 12893
All 17 members sequenced to at least 50x depth (PCR-Free protocol)
Variants are called across the pedigree using different software & technology
Inheritance information provides high confident, direct validation of variant calls
Analysis of SNPs in
the parents and 11
children
4. 4
Pedigree Analysis – Using haplotypes to detect conflicts
A
C
A
G
T
A
A
C
A
G
T
A
A
C
A
G
T
A
A
C
A
T
T
A
A
C
A
G
T
A
A
T
C
T
G
A
A
T
C
T
G
A
A
T
C
T
G
A
G
T
C
G
T
C
G
T
C
G
T
C
G
T
C
G
T
C
G
C
A
T
T
A
G
C
A
T
T
A
G
C
A
T
T
A
G
C
A
T
T
A
G
C
A
T
T
A
With a sufficiently large pedigree all
four possible inheritance patterns
will be observed and most of the
genotypes can be phased into
haplotypes
Parents
Children
5. 5
Using haplotypes to detect conflicts
A
C
A
G
T
A
A
C
A
G
T
A
A
C
A
G
T
A
A
C
A
T
T
A
A
C
A
G
T
A
A
T
C
T
G
A
A
T
C
T
G
A
A
T
C
T
G
A
G
T
C
G
T
C
G
T
C
G
T
C
G
T
C
G
T
C
G
C
A
T
T
A
G
C
A
T
T
A
G
C
A
T
T
A
G
C
A
T
T
A
G
C
A
T
T
A
Individual GT accuracy is assessed
using surrounding genotype calls
across the pedigree
Genotypes are parsimoniously
phased to minimize the number of
conflicts across the pedigree
Facilitates assigning conflicts to
sample, imputation of missing data
and error correction
Error at this sample/position
Parents
Children
6. 6
First step is to define the inheritance of the parental chromosomes to the eleven
children everywhere in the genome
– Identified 709 crossover events between the parents and eleven children
Variants called across the pedigree using multiple callers
– E.g. GATK, Cortex, Isaac & CGI for SNPs
Define accurate variants as those where the genotypes are 100% consistent
with the transmission of the parental haplotypes
– At any position of the genome there are only 16 possible combinations of genotypes
(biallelic & diploid) across the pedigree that are consistent with the inheritance pattern
– 313 (~1.6M) possible genotype combinations
Analysis of variant calls within the pedigree structure
7. 7
Homozygous positions (GATK)
– ~2.6B positions identified as homozygous reference across the pedigree
SNPs (GATK, Cortex, Isaac & CGI)
– ~4.7M positions where SNPs agree with transmission of parental chromosomes
– >95% (4.5M) called consistent with transmission by multiple algorithms/technologies
– >98% (4.6M) with supporting evidence from other call sets (i.e. same variant called in
at least one of the samples)
Indels (GATK, Cortex & CGI)
– ~640k indels consistent with transmission of parental chromosomes
– Events range in size from 1 to 350bp
CNVs (BreakDancer & Grouper)
– ~772 CNVs - mostly deletions though a couple of duplications
– Events range from 1kb to 322kb though still refining break points
Current state
9. 9
Incorporating larger variants
SNPs and small indels work well because the genotypes are highly accurate
– A single genotyping error in any of the 13 samples will almost never be consistent
with the haplotype transmission
Developing approaches for other variants types that have lower calling accuracy
– Many CNV callers do not provide GT information
– Accuracy is too low to use pedigree-consistency
10. 10
Incorporating CNVs into this framework
Make breakpoint calls within
each sample using
BreakDancer & Grouper
Identify regions of overlap
between samples (keeping
singletons)
Corroborate based on read
counts within the putative CNV
events
Refine to breakpoint
resolution
NA12877
NA12878
NA12879
NA12880
NA12881
NA12882
Test Regions
• Count the uniquely aligned reads within the
defined break points for the test regions for each
sample & identify events where the read counts
are consistent with a deletion or duplication
• For internally-consistent events, follow up with
targeted analysis to identify bp resolution of events
• On average ~150x depth for every event
11. 11
AB CD CB DA CB DB DA CB CA DB CB CA DA
0
500
1000
1500
2000
ReadCounts
0
1
2
Using read counts to confirm deletions – 8.5kb deletion
Best Sol’n: A=0 ; B=1 ; C=1 ; D=1
All Samples with
haplotype A are
consistent with
haploid based on
read countsA A A A A A
Diploid
Haploid
Zero-ploid
12. 12
Breakdown of 772 “accurate” CNVs (1kb to 322kb in size)
26640898
BreakDancerGrouper
13. 13
Assembling breakpoints for the 772 CNVs
– Reassessing the “failed” calls where applicable
Incorporating different calling algorithms / methods
– E.g. SNP inheritance can help identify CNVs that are missed by other methods
– Including mate pair data (~2kb insert size)
Working on different methods to improve our catalogue of ~30bp to 2kb events &
incorporating different callers
Assigning error modes for “failed” SNPs
– Many look like cell line mutations & alignment errors
Comparing our call set to other datasets to assess accuracy and completeness
– Other GIAB call sets
– Fosmid data (Jaffe & Kidd)
Next steps
14. 14
Illumina Oxford
Morten Kallberg Zamin Iqbal
Xiaoyu Chen Gil McVean
Han-Yu Chuang
Phil Tedder
Sean Humphray
Elliott Margulies
David Bentley
This data and more available at www.platinumgenomes.org
Acknowledgements
Notes de l'éditeur
Thank you Tanya and thanks to everyone for attending this seminar.
This project grew out of an observation that there is no comprehensive truth set of variant calls and this gap is becoming increasingly problematic as sequencing moves to the clinic. Additionally, the validation that has been done using trio conflicts or perpendicular technologies usually only assess a relatively small percentage of the variants. Alternatively, we are working to solve this by sequencing a large pedigree and using the parental inheritance to assess accuracy of variant calls with the goal that we will deliver a set of highly accurate variant calls, make the data available publicly as a community resource and also demonstrating a framework for validating variant calls and improving variant callers – especially for more complicated variants such as indels and structural variants.
To demonstate the utility of analyzing a full pedigree we have sequenced all 17 members of a well-characterized CEPH pedigree to 50x depth. In addition we have sequenced the trio highlighted in bold to 200x each and performed a technical replicate of the child of this trio (NA12882) again to 200x so that we have a total of 400x sequence depth on this child. For the work I’m presenting today we will concentrate on SNP analysis in the parents and 11 children of the last two generations but we are already looking at indels and larger variants.
The way that we are able to gain power for error detection is by having the ability to calculated inheritance of the parental haplotypes. With a large number of children we will observe all 4 possible pairings of the parental haplotypes and when that occurs we have much increased power to identify genotype errors. Because there are 11 siblings we even have additional power because there are internal replicates built in for some inherited parental haplotype pairings. In this figure, I’ve highlighted the inheritance pattern for six of the children in a small region of chromosome 22 where a single inheritance pattern occurs – e.g. a region bounded by detected crossover events. Within this region we can convert genotypes to haplotypes as I’ve illustrated above.
If we just look at the haplotypes in blue, we can immediately detect conflicts. For example, one child is the “odd man out” out showing a T rather than a G at the fourth site indicating that there is an error in this genotype. This also illustrates the power of this method. Each genotype call is supported or not supported based on the surrounding genotype calls across the pedigree. In practice, when we calculate conflict rates we choose a parsimonious solution that agrees most closely with the observed genotypes and thus will under-estimate the true error rate though likely this effect is small. This method allows us to assign an error to a sample, impute missing calls and, in some cases, error correct.