Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

Striving for Perfection: The Platinum
Genomes Project

Elliott H. Margulies, Ph.D.
Director, Scientific Research
COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
© 2011 Illumina, Inc. All rights reserved.
Illumina, illuminaDx, BeadArray, BeadXpress, cBot, CSPro, DASL, Eco, Genetic Energy, GAIIx, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera,
Sentrix, Solexa, TruSeq, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names
contained herein are the property of their respective owners.

From Sample to Answer
Sample Sequence Analyse Annotate Interpret Answer

Enabling clinical use of WGS

Fast sequencing from low-input and FFPE samples

Improved Accuracy and Utility of detected variants

Integrated “push button” analyses – from sequence to annotated variants

Focus on genome exploration

2

The truth is hard to find…

Sequencing the same genome twice We identify many more Mendelian
does not give you the identical answer conflicts than actually exist

A/A T/T
Variants Dad Mom

First Time ? Second Time
Child

T/T

3

Summary of increased accuracy
Eland+CASAVA
Mendelian
Sensitivity
Conflicts
Accuracy
Filter

96.62
13,032
99.9995% unfiltered

96.10
8,383
99.9997% + gVCF filters

95.25
5,309
99.9998% + score:coverage

1.43% loss 59.26% loss
in sensitivity in conflicts

Sensitivity Conflicts Accuracy Method
95.90 4,928 99.9998% BWA+MPG*

NB: Accuracy is expressed here as % total filtered calls that are Mendelian concordant
* Accurate and comprehensive sequencing of personal genomes
S.S. Ajay, S.C.J. Parker, H. Ozel Abaan, Karin V. Fuentes Fajardo, and E.H. Margulies
Genome Res. 2011 21: 1498-1505

4

A critical assessment of whole-genome
sequencing…
! Where are we doing well?
! What parts of the genome are still inaccessible or less
accurately called – and most importantly, why?

GOALS:
! Maximum utility for use in research and medical applications
! Determine key areas for improvement and assess progress
! Assess performance in real-life situations

5

Platinum genomes: the proposal
! Select a small set of well-known and accessible genomes
! Generate initial WGS datasets using best current practices
! Make it freely available in a database by "open source" principles
! Perform analyses to define high and low quality regions and
variant calls
! Examine low quality regions and calls and validate with additional
evidence (methods)
! Maintain a database with revised data and evidence to
provide a long term benchmark
! Develop improved methods (analysis, chemistry, sample prep)

6

CEPH/Utah Pedigree 1463
12889 12890 12891 12892

12877 12878

12879 12880 12881 12882 12883 12884 12885 12886 12887 12888 12893

! Three generation family, extensively sequenced by the genomics
community
! Focus on the trio shaded in gray (12877 12878 and 12882)
! Sourced ~200µg for the initial trio (shaded) and ~50µg for all
others
7

Initial dataset
Genotype Genotype
Sample
Depth
Q30
coverage
concordance

NA12877
219.63
91.3
99.79
99.25

NA12878
211.88
93.6
99.8
99.25

Technical
NA12882
217.95
93.2
99.8
99.24

Replicate
NA12881
46.67
91.7
99.84
99.28

NA12880
48.37
91.4
99.74
99.28

NA12879
48.01
92
99.75
99.29

NA12883
54.73
94.2
99.6
99.27

NA12884
43.76
93.2
99.7
99.27

NA12885
54.56
94
99.8
99.28

NA12886
64.98
91
99.8
99.28

NA12887
48.33
92.4
99.81
99.29

NA12888
47.61
92.2
99.81
99.28

NA12889
49.99
91
99.49
99.28

NA12890
59.34
88
99.8
99.29

NA12891
45.49
93
99.75
99.28

NA12892
50.32
93.4
99.67
99.29

NA12893
47.69
92.7
99.79
99.28

8

NA12882

Technical Technical
Replicate A Replicate B

200x 200x
(18 lanes) (18 lanes)

100x 100x
100x 100x
(8 lanes) (8 lanes)
(8 lanes) (8 lanes)

50x 50x 50x 50x
50x 50x 50x 50x

! Callability and reproducibility among pairs of replicates
–  50x vs 100x vs 200x
–  Between technical replicates

9

Pair-wise comparisons of genome builds

Concordance at variant positions where both genomes PASSed basic quality filters

Coverage Library SNPs Indels Combined
50x different 99.34%
90.94%
98.52%

50x same 99.36%
90.83%
98.52%

90.60%
98.57%

100x same 99.47%
90.54%
98.56%

90.23%
98.55%

10

NA12882

Technical Technical
Replicate A Replicate B

200x 200x
(18 lanes) (18 lanes)

100x 100x 100x 100x
(8 lanes) (8 lanes) (8 lanes) (8 lanes)

50x 50x 50x 50x 50x 50x 50x 50x

! Consistency across all the replicates
–  How many replicates were able to be called at a given position?
–  How many different genotypes were present at that position?

11

Consistency among technical replicates
Number of different genotypes

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

PASSing genotype quality filter

0
1.96

1
0.23

Number of replicates

2
0.21
0.0005

3
0.18
0.0006
3.5E-‐05

4
0.16
0.0007
4.2E-‐05
8.7E-‐06

5
0.15
0.0007
4.5E-‐05
1.3E-‐05
3.5E-‐06

6
0.15
0.0008
4.6E-‐05
1.6E-‐05
6.1E-‐06
1.4E-‐06

7
0.16
0.0008
4.9E-‐05
1.8E-‐05
8.8E-‐06
3.0E-‐06
8.2E-‐07

8
0.16
0.0007
5.5E-‐05
1.9E-‐05
9.0E-‐06
4.3E-‐06
1.9E-‐06
4.1E-‐07

9
0.17
0.0007
5.6E-‐05
2.0E-‐05
1.1E-‐05
5.2E-‐06
2.5E-‐06
1.4E-‐06
3.7E-‐07

10
0.20
0.0006
6.1E-‐05
2.1E-‐05
1.1E-‐05
7.4E-‐06
3.8E-‐06
1.9E-‐06
7.1E-‐07
1.9E-‐07

11
0.24
0.0006
6.9E-‐05
2.6E-‐05
1.4E-‐05
9.4E-‐06
6.4E-‐06
3.7E-‐06
1.5E-‐06
3.7E-‐07
7.4E-‐08

12
0.32
0.0007
8.5E-‐05
3.2E-‐05
1.9E-‐05
1.2E-‐05
8.6E-‐06
5.5E-‐06
2.8E-‐06
1.3E-‐06
4.8E-‐07
7.4E-‐08

13
0.61
0.0010
1.2E-‐04
4.3E-‐05
2.8E-‐05
1.9E-‐05
1.5E-‐05
1.1E-‐05
7.4E-‐06
4.6E-‐06
2.0E-‐06
6.7E-‐07
2.2E-‐07

14
95.07
0.0025
2.3E-‐04
8.6E-‐05
5.3E-‐05
4.0E-‐05
3.6E-‐05
3.3E-‐05
3.0E-‐05
2.3E-‐05
1.4E-‐05
7.6E-‐06
2.1E-‐06
6.0E-‐07

“Metal”
Genome
SNVs
from
a
50x
build

Gold
95.1%
94.80%
3,030,777

Silver
2.95%
4.15%
132,579

Copper
0.01%
1.05%
33,679

Lead
1.96%

12

Genomic features overlapping with “metal”
regions

Genome
SNVs
CDS
medCDS

gold
95.07%
94.80%
96.91%
97.87%

silver
2.95%
4.15%
1.35%
1.11%

copper
0.01%
1.05%
0.003%
0.002%

lead
1.96%
0.00%
1.74%
1.02%

13

A closer examination of “Copper” regions:
those that had more than one genotype
86% of copper regions had just two different genotypes

Type
of

inconsistency
Percentage

REF
/
het
SNV
37.40

REF
/
het
DEL
21.89

REF
/
het
INS
15.11

het
SNV
/
hom
SNV
5.38

het
DEL
/
hom
DEL
0.42

het
INS
/
hom
INS
1.43

Remaining
18.38

14

Concordance in “metal” regions
SNP concordance from two builds generated from different libraries

50x
100x
200x

ALL
99.34%
99.47%
99.53%

Gold
99.80%
99.94%
99.94%

Silver
85.00%
89.81%
93.80%

Copper
53.85%
67.85%
82.12%

Lead*
519
6,589
22,164

Non-gold regions of the genome point to areas that
are not comprehensively/accurately assessed

*
Absolute
values
more
revealing

15

Concordance in “metal” regions
Concordance of variants between two 100x builds from the same library

SNPs
Indels
Both

Overall
99.47%
90.54%
98.56%

Gold
99.92%
96.77%
99.65%

Silver
90.65%
68.18%
86.32%

Copper
77.13%
57.11%
61.00%

Lead
73.44%
74.73%
73.88%

Indels need more attention

16

Practical/Clinical/Medical Relevance

200x build comparison in medically-relevant CDS regions

Percent Percent
Metal
ALL
Same
Different
the Same
in Metal

Combined
1,187
1,182
5
99.58%

Gold
1,151
1,151
0
100.00%
96.97%

Silver
29
26
3
89.66%
2.44%

Copper
2
2
0
100.00%
0.17%

Lead
5
3
2
60.00%
0.42%

17

Future Plans
! Classify inconsistent parts of the genome into:
–  Alignment or read length issues
§  Paralogous/repetitive/CNV regions
§  Missed or wrong indel calls
–  Depth of coverage
–  Platform-specific artifacts

! Disseminate data/analyses to the research community
! Platform for developing better indel detection
! Error correction via haplotyping efforts
! Independent validation efforts
! Develop a database of variants and associated evidence

18

Acknowledgements
! David Bentley ! Klaus Maisinger
! Sean Humphray ! Russell Grocock
! Mark Ross ! Peter Saffrey
! Nick Kerry ! Brad Sickler
! Nondas Fritzilas ! Pedro Cruz
! Phil Tedder ! Shankar Ajay
! Mike Eberle ! Marc Laurant
! Lisa Murray ! Semyon Kruglyak

19

Accurate and comprehensive sequencing of pe
Subramanian S. Ajay, Stephen C.J. Parker, Hatice Ozel Abaan, et al.

Genome Res. published online July 19, 2011
Downloaded from genome.cshlp.org on July 20, 2011 - Published by Cold Spring Harbor Laboratory Press
Access the most recent version at doi:10.1101/gr.123638.111
Research

Accurate and comprehensive sequencing
Supplemental http://genome.cshlp.org/content/suppl/2011/06
Material
of personal genomes
P<P Published online July 19, 2011 in advance of the p
Subramanian S. Ajay,1 Stephen C.J. Parker,1 Hatice Ozel Abaan,1
Karin V. Fuentes Fajardo,2 and Elliott H. Margulies1,3,4 Freely available online through the Genome Resea
Open Access
1
Genome Informatics Section, Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health,
Email alerting Receive free email alerts when
Bethesda, Maryland 20892, USA; 2Undiagnosed Diseases Program, Office of the Clinical Director, National Human Genome Research new articles cite th
service
Institute, National Institutes of Health, Bethesda, Maryland 20892, USA top right corner of the article or click here
As whole-genome sequencing becomes commoditized and we begin to sequence and analyze personal genomes for clinical
and diagnostic purposes, it is necessary to understand what constitutes a complete sequencing experiment for determining
genotypes and detecting single-nucleotide variants. Here, we show that the current recommendation of ~30@ coverage is
not adequate to produce genotype calls across a large fraction of the genome with acceptably low error rates. Our results
Genotype calls
are based on analyses of a clinical sample sequenced on two related Illumina platforms, GAIIx and HiSeq 2000, to a very
high depth (126@). We used these data to establish genotype-calling filters that dramatically increase accuracy. We also
empirically determined how the callable portion of the genome varies as a function of the amount of sequence data used.
These results help provide a ‘‘sequencing guide’’ for future whole-genome sequencing decisions and metrics by which
50x
coverage statistics should be reported. 50x
[Supplemental material is available for this article.]

Whole-genome sequencing and analysis is becoming part of a hg19
callable

a question that is extremely important as whole-genome se-

Filter

translational research toolkit (Lupski et al. 2010; Sobreira et al.
2010) to investigate small-scale changes such as single-nucleotide
In
both
Discordant

quencing and analysis of individual genomes transitions from
primarily research-based projects to being used for clinical and
variants (SNVs) and indels (Bentley et al. 2008; Wang et al. 2008; diagnostic applications. Additionally, we seek to understand the
No
extra
ﬁlters

Kim et al. 2009; McKernan et al. 2009; Fujimoto et al. 2010; Lee 98.33%
46,580

relationship between the amount of sequence data generated and
et al. 2010; Pleasance et al. 2010) in addition to large-scale events the resulting proportion of the genome where confident geno-
With
alignment
and
genotype
Filters

such as chromosomal rearrangements (Campbell et al. 2008;
Chen et al. 2008) and copy-number variation (Chiang et al. 2009;
93.13%
1,673

types can be derived—we refer to this as the ‘‘callable’’ portion,
a term that is roughly equivalent to the 1000 Genomes Project’s
Park et al. 2010). For both basic genome biology and clinical ‘‘accessible’’ portion. Using these sequencing metrics and geno-
No
q20
Evidence
(MapQ1)

diagnostics, the trade-offs of data quality and quantity will de- 267

type-calling filters will help obviate the need for costly and time-
termine what constitutes a ‘‘comprehensive and accurate’’ whole- consuming validation efforts. Currently, no empirically derived

21
genome analysis, especially for detecting SNVs. As whole-genome
sequencing becomes commoditized, it will be important to deter-
data sets exist for determining how much sequence data is needed
to enable accurate detection of SNVs. NHGRI
mine quantitative metrics to assess and describe the comprehen- To address this issue, we sequenced a blood sample from a
siveness of an individual’s genome sequence. No such standards male individual with an undiagnosed clinical condition on two
currently exist. related platforms—Illumina’s GAIIx and HiSeq 2000—to a total of

Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

Recommended

Recommended

More Related Content

More from GenomeInABottle

More from GenomeInABottle (20)

Recently uploaded

Recently uploaded (20)

Elliott Margulies - Striving for Perfection: The Platinum Genomes Project