Aug2013 illumina platinum genomes

© 2010 Illumina, Inc. All rights reserved.
Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro,
GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.
Platinum Genomes:
Identifying variants
using a large
pedigree
Michael A. Eberle
GIAB August, 2013

2
Platinum Genome project: Improving technology & tools
Create a catalogue of highly accurate whole-genome variant calls within a well
characterized pedigree
– SNPs, indels & CNVs
– Including highly confident reference positions
– Provide direct supporting evidence for every variant call
Develop a framework to assess variant callers
Provide a path to improve variant callers by providing a better truth data to
sensitively assess sensitivity and precision
– Modifying the SNP filters to maximize accuracy
Correct FPFN
Truth Test

3
NIST GIAB – Pedigree analysis
12889 12890 12891 12892
12877 12878
12879 12880 12881 12882 12883 12884 12885 1288712886 12888 12893
All 17 members sequenced to at least 50x depth (PCR-Free protocol)
Variants are called across the pedigree using different software & technology
Inheritance information provides high confident, direct validation of variant calls
Analysis of SNPs in
the parents and 11
children

4
Pedigree Analysis – Using haplotypes to detect conflicts
A
C
A
G
T
A
A
C
A
G
T
A
A
C
A
G
T
A
A
C
A
T
T
A
A
C
A
G
T
A
A
T
C
T
G
A
A
T
C
T
G
A
A
T
C
T
G
A
G
T
C
G
T
C
G
T
C
G
T
C
G
T
C
G
T
C
G
C
A
T
T
A
G
C
A
T
T
A
G
C
A
T
T
A
G
C
A
T
T
A
G
C
A
T
T
A
With a sufficiently large pedigree all
four possible inheritance patterns
will be observed and most of the
genotypes can be phased into
haplotypes
Parents
Children

5
Using haplotypes to detect conflicts
A
C
A
G
T
A
A
C
A
G
T
A
A
C
A
G
T
A
A
C
A
T
T
A
A
C
A
G
T
A
A
T
C
T
G
A
A
T
C
T
G
A
A
T
C
T
G
A
G
T
C
G
T
C
G
T
C
G
T
C
G
T
C
G
T
C
G
C
A
T
T
A
G
C
A
T
T
A
G
C
A
T
T
A
G
C
A
T
T
A
G
C
A
T
T
A
Individual GT accuracy is assessed
using surrounding genotype calls
across the pedigree
Genotypes are parsimoniously
phased to minimize the number of
conflicts across the pedigree
Facilitates assigning conflicts to
sample, imputation of missing data
and error correction
Error at this sample/position
Parents
Children

6
First step is to define the inheritance of the parental chromosomes to the eleven
children everywhere in the genome
– Identified 709 crossover events between the parents and eleven children
Variants called across the pedigree using multiple callers
– E.g. GATK, Cortex, Isaac & CGI for SNPs
Define accurate variants as those where the genotypes are 100% consistent
with the transmission of the parental haplotypes
– At any position of the genome there are only 16 possible combinations of genotypes
(biallelic & diploid) across the pedigree that are consistent with the inheritance pattern
– 313 (~1.6M) possible genotype combinations
Analysis of variant calls within the pedigree structure

7
Homozygous positions (GATK)
– ~2.6B positions identified as homozygous reference across the pedigree
SNPs (GATK, Cortex, Isaac & CGI)
– ~4.7M positions where SNPs agree with transmission of parental chromosomes
– >95% (4.5M) called consistent with transmission by multiple algorithms/technologies
– >98% (4.6M) with supporting evidence from other call sets (i.e. same variant called in
at least one of the samples)
Indels (GATK, Cortex & CGI)
– ~640k indels consistent with transmission of parental chromosomes
– Events range in size from 1 to 350bp
CNVs (BreakDancer & Grouper)
– ~772 CNVs - mostly deletions though a couple of duplications
– Events range from 1kb to 322kb though still refining break points
Current state

9
Incorporating larger variants
SNPs and small indels work well because the genotypes are highly accurate
– A single genotyping error in any of the 13 samples will almost never be consistent
with the haplotype transmission
Developing approaches for other variants types that have lower calling accuracy
– Many CNV callers do not provide GT information
– Accuracy is too low to use pedigree-consistency

10
Incorporating CNVs into this framework
Make breakpoint calls within
each sample using
BreakDancer & Grouper
Identify regions of overlap
between samples (keeping
singletons)
Corroborate based on read
counts within the putative CNV
events
Refine to breakpoint
resolution
NA12877
NA12878
NA12879
NA12880
NA12881
NA12882
Test Regions
• Count the uniquely aligned reads within the
defined break points for the test regions for each
sample & identify events where the read counts
are consistent with a deletion or duplication
• For internally-consistent events, follow up with
targeted analysis to identify bp resolution of events
• On average ~150x depth for every event

11
AB CD CB DA CB DB DA CB CA DB CB CA DA
0
500
1000
1500
2000
ReadCounts
0
1
2
Using read counts to confirm deletions – 8.5kb deletion
Best Sol’n: A=0 ; B=1 ; C=1 ; D=1
All Samples with
haplotype A are
consistent with
haploid based on
read countsA A A A A A
Diploid
Haploid
Zero-ploid

12
Breakdown of 772 “accurate” CNVs (1kb to 322kb in size)
26640898
BreakDancerGrouper

13
Assembling breakpoints for the 772 CNVs
– Reassessing the “failed” calls where applicable
Incorporating different calling algorithms / methods
– E.g. SNP inheritance can help identify CNVs that are missed by other methods
– Including mate pair data (~2kb insert size)
Working on different methods to improve our catalogue of ~30bp to 2kb events &
incorporating different callers
Assigning error modes for “failed” SNPs
– Many look like cell line mutations & alignment errors
Comparing our call set to other datasets to assess accuracy and completeness
– Other GIAB call sets
– Fosmid data (Jaffe & Kidd)
Next steps

14
Illumina Oxford
Morten Kallberg Zamin Iqbal
Xiaoyu Chen Gil McVean
Han-Yu Chuang
Phil Tedder
Sean Humphray
Elliott Margulies
David Bentley
This data and more available at www.platinumgenomes.org
Acknowledgements

Aug2013 illumina platinum genomes

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Aug2013 illumina platinum genomes

Similaire à Aug2013 illumina platinum genomes (20)

Plus de GenomeInABottle

Plus de GenomeInABottle (20)

Dernier

Dernier (20)

Aug2013 illumina platinum genomes

Notes de l'éditeur