SlideShare a Scribd company logo
1 of 21
Download to read offline
Striving for Perfection: The Platinum
                                 Genomes Project

                                                                       Elliott H. Margulies, Ph.D.
                                                                      Director, Scientific Research
                                                                                           COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
© 2011 Illumina, Inc. All rights reserved.
Illumina, illuminaDx, BeadArray, BeadXpress, cBot, CSPro, DASL, Eco, Genetic Energy, GAIIx, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera,
Sentrix, Solexa, TruSeq, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names
contained herein are the property of their respective owners.
From Sample to Answer
Sample          Sequence       Analyse       Annotate        Interpret   Answer




    Enabling clinical use of WGS

      Fast sequencing from low-input and FFPE samples

        Improved Accuracy and Utility of detected variants

          Integrated “push button” analyses – from sequence to annotated variants

              Focus on genome exploration




2
The truth is hard to find…

 Sequencing the same genome twice         We identify many more Mendelian
does not give you the identical answer       conflicts than actually exist

                                                   A/A            T/T
                 Variants                           Dad           Mom




    First Time     ?        Second Time
                                                          Child



                                                          T/T




3
Summary of increased accuracy
                                                                                               Eland+CASAVA
                                       Mendelian
                Sensitivity	
          Conflicts	
              Accuracy	
                   Filter	
  
                     96.62	
               13,032	
              99.9995%            unfiltered	
  
                     96.10	
                8,383	
              99.9997%            + gVCF filters	
  
                     95.25	
                5,309	
              99.9998%            + score:coverage


                1.43% loss             59.26% loss
               in sensitivity          in conflicts

                Sensitivity              Conflicts              Accuracy                    Method
                     95.90                  4,928                99.9998%                  BWA+MPG*

    NB: Accuracy is expressed here as % total filtered calls that are Mendelian concordant
* Accurate and comprehensive sequencing of personal genomes
   S.S. Ajay, S.C.J. Parker, H. Ozel Abaan, Karin V. Fuentes Fajardo, and E.H. Margulies
   Genome Res. 2011 21: 1498-1505

4
A critical assessment of whole-genome
sequencing…
    !   Where are we doing well?
    !   What parts of the genome are still inaccessible or less
        accurately called – and most importantly, why?


    GOALS:
    !   Maximum utility for use in research and medical applications
    !   Determine key areas for improvement and assess progress
    !   Assess performance in real-life situations



5
Platinum genomes: the proposal
    !   Select a small set of well-known and accessible genomes
    !   Generate initial WGS datasets using best current practices
    !   Make it freely available in a database by "open source" principles
    !   Perform analyses to define high and low quality regions and
        variant calls
    !   Examine low quality regions and calls and validate with additional
        evidence (methods)
    !   Maintain a database with revised data and evidence to
        provide a long term benchmark
    !   Develop improved methods (analysis, chemistry, sample prep)


6
CEPH/Utah Pedigree 1463
                                  12889           12890           12891           12892




                                          12877                           12878




                  12879   12880   12881   12882   12883   12884   12885   12886   12887   12888   12893




    !   Three generation family, extensively sequenced by the genomics
        community
    !   Focus on the trio shaded in gray (12877 12878 and 12882)
    !   Sourced ~200µg for the initial trio (shaded) and ~50µg for all
        others
7
Initial dataset
                                                  Genotype      Genotype
            Sample	
      Depth	
      Q30	
      coverage	
   concordance	
  
            NA12877	
     219.63	
     91.3	
       99.79	
        99.25	
  
            NA12878	
     211.88	
     93.6	
        99.8	
        99.25	
  
Technical
            NA12882	
     217.95	
     93.2	
        99.8	
        99.24	
  
Replicate
            NA12881	
     46.67	
      91.7	
       99.84	
        99.28	
  
            NA12880	
     48.37	
      91.4	
       99.74	
        99.28	
  
            NA12879	
     48.01	
       92	
        99.75	
        99.29	
  
            NA12883	
     54.73	
      94.2	
        99.6	
        99.27	
  
            NA12884	
     43.76	
      93.2	
        99.7	
        99.27	
  
            NA12885	
     54.56	
       94	
         99.8	
        99.28	
  
            NA12886	
     64.98	
       91	
         99.8	
        99.28	
  
            NA12887	
     48.33	
      92.4	
       99.81	
        99.29	
  
            NA12888	
     47.61	
      92.2	
       99.81	
        99.28	
  
            NA12889	
     49.99	
       91	
        99.49	
        99.28	
  
            NA12890	
     59.34	
       88	
         99.8	
        99.29	
  
            NA12891	
     45.49	
       93	
        99.75	
        99.28	
  
            NA12892	
     50.32	
      93.4	
       99.67	
        99.29	
  
            NA12893	
     47.69	
      92.7	
       99.79	
        99.28	
  
 8
NA12882

                           Technical                      Technical
                          Replicate A                    Replicate B



                             200x            200x
                           (18 lanes)      (18 lanes)



                       100x            100x
                                    100x            100x
                     (8 lanes)       (8 lanes)
                                  (8 lanes)       (8 lanes)


                    50x   50x      50x 50x
                                 50x 50x         50x    50x


    !   Callability and reproducibility among pairs of replicates
         –  50x vs 100x vs 200x
         –  Between technical replicates


9
Pair-wise comparisons of genome builds

     Concordance at variant positions where both genomes PASSed basic quality filters



            Coverage      Library       SNPs          Indels       Combined
                50x       different     99.34%	
      90.94%	
      98.52%	
  
                50x        same         99.36%	
      90.83%	
      98.52%	
  
               100x       different     99.47%	
      90.60%	
      98.57%	
  
               100x        same         99.47%	
      90.54%	
      98.56%	
  
               200x       different     99.53%	
      90.23%	
      98.55%	
  




10
NA12882

                        Technical                   Technical
                       Replicate A                 Replicate B



                          200x                        200x
                        (18 lanes)                  (18 lanes)



                    100x         100x          100x          100x
                  (8 lanes)    (8 lanes)     (8 lanes)     (8 lanes)


                 50x   50x    50x    50x     50x    50x   50x    50x


 !   Consistency across all the replicates
      –  How many replicates were able to be called at a given position?
      –  How many different genotypes were present at that position?


11
Consistency among technical replicates
                                                                                                         Number of different genotypes
                                  	
  	
             0	
          1	
         2	
            3	
            4	
          5	
           6	
           7	
           8	
           9	
          10	
          11	
          12	
          13	
          14	
  
PASSing genotype quality filter


                                              0	
   1.96	
  
                                              1	
               0.23	
  
   Number of replicates




                                              2	
               0.21	
     0.0005	
  
                                              3	
               0.18	
     0.0006	
     3.5E-­‐05	
  
                                              4	
               0.16	
     0.0007	
     4.2E-­‐05	
   8.7E-­‐06	
  
                                              5	
               0.15	
     0.0007	
     4.5E-­‐05	
   1.3E-­‐05	
   3.5E-­‐06	
  
                                              6	
               0.15	
     0.0008	
     4.6E-­‐05	
   1.6E-­‐05	
   6.1E-­‐06	
   1.4E-­‐06	
  
                                              7	
               0.16	
     0.0008	
     4.9E-­‐05	
   1.8E-­‐05	
   8.8E-­‐06	
   3.0E-­‐06	
   8.2E-­‐07	
  
                                              8	
               0.16	
     0.0007	
     5.5E-­‐05	
   1.9E-­‐05	
   9.0E-­‐06	
   4.3E-­‐06	
   1.9E-­‐06	
   4.1E-­‐07	
  
                                              9	
               0.17	
     0.0007	
     5.6E-­‐05	
   2.0E-­‐05	
   1.1E-­‐05	
   5.2E-­‐06	
   2.5E-­‐06	
   1.4E-­‐06	
   3.7E-­‐07	
  
                                             10	
               0.20	
     0.0006	
     6.1E-­‐05	
   2.1E-­‐05	
   1.1E-­‐05	
   7.4E-­‐06	
   3.8E-­‐06	
   1.9E-­‐06	
   7.1E-­‐07	
   1.9E-­‐07	
  
                                             11	
               0.24	
     0.0006	
     6.9E-­‐05	
   2.6E-­‐05	
   1.4E-­‐05	
   9.4E-­‐06	
   6.4E-­‐06	
   3.7E-­‐06	
   1.5E-­‐06	
   3.7E-­‐07	
   7.4E-­‐08	
  
                                             12	
               0.32	
     0.0007	
     8.5E-­‐05	
   3.2E-­‐05	
   1.9E-­‐05	
   1.2E-­‐05	
   8.6E-­‐06	
   5.5E-­‐06	
   2.8E-­‐06	
   1.3E-­‐06	
   4.8E-­‐07	
   7.4E-­‐08	
  
                                             13	
               0.61	
     0.0010	
     1.2E-­‐04	
   4.3E-­‐05	
   2.8E-­‐05	
   1.9E-­‐05	
   1.5E-­‐05	
   1.1E-­‐05	
   7.4E-­‐06	
   4.6E-­‐06	
   2.0E-­‐06	
   6.7E-­‐07	
   2.2E-­‐07	
  
                                             14	
              95.07	
     0.0025	
     2.3E-­‐04	
   8.6E-­‐05	
   5.3E-­‐05	
   4.0E-­‐05	
   3.6E-­‐05	
   3.3E-­‐05	
   3.0E-­‐05	
   2.3E-­‐05	
   1.4E-­‐05	
   7.6E-­‐06	
   2.1E-­‐06	
   6.0E-­‐07	
  


                                                                                      “Metal”	
                      Genome	
                       SNVs	
  from	
  a	
  50x	
  build	
  
                                                                                                        Gold	
           95.1%	
                         94.80%	
                        3,030,777	
  
                                                                                                       Silver	
          2.95%	
                             4.15%	
                      132,579	
  
                                                                                                     Copper	
            0.01%	
                             1.05%	
                      33,679	
  
                                                                                                        Lead	
           1.96%	
  
           12
Genomic features overlapping with “metal”
regions

                 Genome	
   SNVs	
             CDS	
       medCDS	
  
           gold	
   95.07%	
     94.80%	
     96.91%	
      97.87%	
  
         silver	
   2.95%	
      4.15%	
      1.35%	
       1.11%	
  
        copper	
   0.01%	
       1.05%	
      0.003%	
      0.002%	
  
           lead	
   1.96%	
      0.00%	
      1.74%	
       1.02%	
  




13
A closer examination of “Copper” regions:
those that had more than one genotype
      86% of copper regions had just two different genotypes

                    Type	
  of	
  
                 inconsistency	
                        Percentage	
  
                          REF	
  /	
  het	
  SNV	
          37.40	
  
                          REF	
  /	
  het	
  DEL	
          21.89	
  
                           REF	
  /	
  het	
  INS	
         15.11	
  
                 het	
  SNV	
  /	
  hom	
  SNV	
            5.38	
  
                  het	
  DEL	
  /	
  hom	
  DEL	
           0.42	
  
                   het	
  INS	
  /	
  hom	
  INS	
          1.43	
  
                               Remaining	
                  18.38	
  




14
Concordance in “metal” regions
                SNP concordance from two builds generated from different libraries

                                                       50x	
        100x	
       200x	
  
                                              ALL	
   99.34%	
     99.47%	
     99.53%	
  

                                            Gold	
   99.80%	
      99.94%	
     99.94%	
  

                                           Silver	
   85.00%	
     89.81%	
     93.80%	
  

                                        Copper	
   53.85%	
        67.85%	
     82.12%	
  

                                          Lead*	
       519	
       6,589	
     22,164	
  


               Non-gold regions of the genome point to areas that
                 are not comprehensively/accurately assessed

*	
  Absolute	
  values	
  more	
  revealing	
  
  15
Concordance in “metal” regions
     Concordance of variants between two 100x builds from the same library

                                  SNPs	
       Indels	
     Both	
  
                     Overall	
   99.47%	
      90.54%	
     98.56%	
  
                        Gold	
   99.92%	
      96.77%	
     99.65%	
  
                       Silver	
   90.65%	
     68.18%	
     86.32%	
  
                     Copper	
   77.13%	
       57.11%	
     61.00%	
  
                        Lead	
   73.44%	
      74.73%	
     73.88%	
  


                      Indels need more attention



16
Practical/Clinical/Medical Relevance

                 200x build comparison in medically-relevant CDS regions


                                                               Percent Percent
     Metal	
           ALL	
       Same	
       Different	
   the Same	
   in Metal	
  
     Combined	
        1,187	
      1,182	
          5	
       99.58%	
  
           Gold	
      1,151	
      1,151	
          0	
      100.00%	
      96.97%	
  
          Silver	
       29	
         26	
           3	
       89.66%	
      2.44%	
  
        Copper	
          2	
          2	
           0	
      100.00%	
      0.17%	
  
           Lead	
         5	
          3	
           2	
       60.00%	
      0.42%	
  




17
Future Plans
 !   Classify inconsistent parts of the genome into:
      –  Alignment or read length issues
          §    Paralogous/repetitive/CNV regions
          §    Missed or wrong indel calls
      –  Depth of coverage
      –  Platform-specific artifacts

 !   Disseminate data/analyses to the research community
 !   Platform for developing better indel detection
 !   Error correction via haplotyping efforts
 !   Independent validation efforts
 !   Develop a database of variants and associated evidence

18
Acknowledgements
 !   David Bentley      !   Klaus Maisinger
 !   Sean Humphray      !   Russell Grocock
 !   Mark Ross          !   Peter Saffrey
 !   Nick Kerry         !   Brad Sickler
 !   Nondas Fritzilas   !   Pedro Cruz
 !   Phil Tedder        !   Shankar Ajay
 !   Mike Eberle        !   Marc Laurant
 !   Lisa Murray        !   Semyon Kruglyak




19
END




20
Accurate and comprehensive sequencing of pe
                                                     Subramanian S. Ajay, Stephen C.J. Parker, Hatice Ozel Abaan, et al.

                                                     Genome Res. published online July 19, 2011
                     Downloaded from genome.cshlp.org on July 20, 2011 - Published by Cold Spring Harbor Laboratory Press
                                                     Access the most recent version at doi:10.1101/gr.123638.111
     Research

     Accurate and comprehensive sequencing
                         Supplemental http://genome.cshlp.org/content/suppl/2011/06
                             Material
     of personal genomes
                                                                                P<P             Published online July 19, 2011 in advance of the p
     Subramanian S. Ajay,1 Stephen C.J. Parker,1 Hatice Ozel Abaan,1
     Karin V. Fuentes Fajardo,2 and Elliott H. Margulies1,3,4 Freely available online through the Genome Resea
                                         Open Access
     1
      Genome Informatics Section, Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health,
                                                             Email alerting             Receive free email alerts when
     Bethesda, Maryland 20892, USA; 2Undiagnosed Diseases Program, Office of the Clinical Director, National Human Genome Research                  new articles cite th
                                                                   service
     Institute, National Institutes of Health, Bethesda, Maryland 20892, USA            top right corner of the article or                          click here
           As whole-genome sequencing becomes commoditized and we begin to sequence and analyze personal genomes for clinical
           and diagnostic purposes, it is necessary to understand what constitutes a complete sequencing experiment for determining
           genotypes and detecting single-nucleotide variants. Here, we show that the current recommendation of ~30@ coverage is
           not adequate to produce genotype calls across a large fraction of the genome with acceptably low error rates. Our results
                                                              Genotype calls
           are based on analyses of a clinical sample sequenced on two related Illumina platforms, GAIIx and HiSeq 2000, to a very
           high depth (126@). We used these data to establish genotype-calling filters that dramatically increase accuracy. We also
           empirically determined how the callable portion of the genome varies as a function of the amount of sequence data used.
           These results help provide a ‘‘sequencing guide’’ for future whole-genome sequencing decisions and metrics by which
                                               50x
           coverage statistics should be reported.                                              50x
           [Supplemental material is available for this article.]

     Whole-genome sequencing and analysis is becoming part of a                    hg19	
  callable	
  
                                                                             a question that is extremely important as whole-genome se-

                                     Filter	
  
     translational research toolkit (Lupski et al. 2010; Sobreira et al.
     2010) to investigate small-scale changes such as single-nucleotide
                                                                                     In	
  both	
                    Discordant	
  
                                                                             quencing and analysis of individual genomes transitions from
                                                                             primarily research-based projects to being used for clinical and
     variants (SNVs) and indels (Bentley et al. 2008; Wang et al. 2008;      diagnostic applications. Additionally, we seek to understand the
                             No	
  extra	
  filters	
  
     Kim et al. 2009; McKernan et al. 2009; Fujimoto et al. 2010; Lee                    98.33%	
                         46,580	
  
                                                                             relationship between the amount of sequence data generated and
     et al. 2010; Pleasance et al. 2010) in addition to large-scale events   the resulting proportion of the genome where confident geno-
          With	
  alignment	
  and	
  genotype	
  Filters	
  
     such as chromosomal rearrangements (Campbell et al. 2008;
     Chen et al. 2008) and copy-number variation (Chiang et al. 2009;
                                                                                         93.13%	
                          1,673	
  
                                                                             types can be derived—we refer to this as the ‘‘callable’’ portion,
                                                                             a term that is roughly equivalent to the 1000 Genomes Project’s
     Park et al. 2010). For both basic genome biology and clinical           ‘‘accessible’’ portion. Using these sequencing metrics and geno-
                   No	
  q20	
  Evidence	
  (MapQ1)	
  
     diagnostics, the trade-offs of data quality and quantity will de-                                                       267	
  
                                                                             type-calling filters will help obviate the need for costly and time-
     termine what constitutes a ‘‘comprehensive and accurate’’ whole-        consuming validation efforts. Currently, no empirically derived

21
     genome analysis, especially for detecting SNVs. As whole-genome
     sequencing becomes commoditized, it will be important to deter-
                                                                             data sets exist for determining how much sequence data is needed
                                                                             to enable accurate detection of SNVs.                                  NHGRI
     mine quantitative metrics to assess and describe the comprehen-               To address this issue, we sequenced a blood sample from a
     siveness of an individual’s genome sequence. No such standards          male individual with an undiagnosed clinical condition on two
     currently exist.                                                        related platforms—Illumina’s GAIIx and HiSeq 2000—to a total of

More Related Content

More from GenomeInABottle

GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023GenomeInABottle
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGenomeInABottle
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923GenomeInABottle
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907GenomeInABottle
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...GenomeInABottle
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGenomeInABottle
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020GenomeInABottle
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGenomeInABottle
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGenomeInABottle
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGenomeInABottle
 
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGenomeInABottle
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGenomeInABottle
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGenomeInABottle
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyGenomeInABottle
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 

More from GenomeInABottle (20)

2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
 
GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023
 
Stratomod ASHG 2023
Stratomod ASHG 2023Stratomod ASHG 2023
Stratomod ASHG 2023
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussion
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant poster
 
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant poster
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assembly
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 

Recently uploaded

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

Elliott Margulies - Striving for Perfection: The Platinum Genomes Project

  • 1. Striving for Perfection: The Platinum Genomes Project Elliott H. Margulies, Ph.D. Director, Scientific Research COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE © 2011 Illumina, Inc. All rights reserved. Illumina, illuminaDx, BeadArray, BeadXpress, cBot, CSPro, DASL, Eco, Genetic Energy, GAIIx, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera, Sentrix, Solexa, TruSeq, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.
  • 2. From Sample to Answer Sample Sequence Analyse Annotate Interpret Answer Enabling clinical use of WGS Fast sequencing from low-input and FFPE samples Improved Accuracy and Utility of detected variants Integrated “push button” analyses – from sequence to annotated variants Focus on genome exploration 2
  • 3. The truth is hard to find… Sequencing the same genome twice We identify many more Mendelian does not give you the identical answer conflicts than actually exist A/A T/T Variants Dad Mom First Time ? Second Time Child T/T 3
  • 4. Summary of increased accuracy Eland+CASAVA Mendelian Sensitivity   Conflicts   Accuracy   Filter   96.62   13,032   99.9995% unfiltered   96.10   8,383   99.9997% + gVCF filters   95.25   5,309   99.9998% + score:coverage 1.43% loss 59.26% loss in sensitivity in conflicts Sensitivity Conflicts Accuracy Method 95.90 4,928 99.9998% BWA+MPG* NB: Accuracy is expressed here as % total filtered calls that are Mendelian concordant * Accurate and comprehensive sequencing of personal genomes S.S. Ajay, S.C.J. Parker, H. Ozel Abaan, Karin V. Fuentes Fajardo, and E.H. Margulies Genome Res. 2011 21: 1498-1505 4
  • 5. A critical assessment of whole-genome sequencing… ! Where are we doing well? ! What parts of the genome are still inaccessible or less accurately called – and most importantly, why? GOALS: ! Maximum utility for use in research and medical applications ! Determine key areas for improvement and assess progress ! Assess performance in real-life situations 5
  • 6. Platinum genomes: the proposal ! Select a small set of well-known and accessible genomes ! Generate initial WGS datasets using best current practices ! Make it freely available in a database by "open source" principles ! Perform analyses to define high and low quality regions and variant calls ! Examine low quality regions and calls and validate with additional evidence (methods) ! Maintain a database with revised data and evidence to provide a long term benchmark ! Develop improved methods (analysis, chemistry, sample prep) 6
  • 7. CEPH/Utah Pedigree 1463 12889 12890 12891 12892 12877 12878 12879 12880 12881 12882 12883 12884 12885 12886 12887 12888 12893 ! Three generation family, extensively sequenced by the genomics community ! Focus on the trio shaded in gray (12877 12878 and 12882) ! Sourced ~200µg for the initial trio (shaded) and ~50µg for all others 7
  • 8. Initial dataset Genotype Genotype Sample   Depth   Q30   coverage   concordance   NA12877   219.63   91.3   99.79   99.25   NA12878   211.88   93.6   99.8   99.25   Technical NA12882   217.95   93.2   99.8   99.24   Replicate NA12881   46.67   91.7   99.84   99.28   NA12880   48.37   91.4   99.74   99.28   NA12879   48.01   92   99.75   99.29   NA12883   54.73   94.2   99.6   99.27   NA12884   43.76   93.2   99.7   99.27   NA12885   54.56   94   99.8   99.28   NA12886   64.98   91   99.8   99.28   NA12887   48.33   92.4   99.81   99.29   NA12888   47.61   92.2   99.81   99.28   NA12889   49.99   91   99.49   99.28   NA12890   59.34   88   99.8   99.29   NA12891   45.49   93   99.75   99.28   NA12892   50.32   93.4   99.67   99.29   NA12893   47.69   92.7   99.79   99.28   8
  • 9. NA12882 Technical Technical Replicate A Replicate B 200x 200x (18 lanes) (18 lanes) 100x 100x 100x 100x (8 lanes) (8 lanes) (8 lanes) (8 lanes) 50x 50x 50x 50x 50x 50x 50x 50x ! Callability and reproducibility among pairs of replicates –  50x vs 100x vs 200x –  Between technical replicates 9
  • 10. Pair-wise comparisons of genome builds Concordance at variant positions where both genomes PASSed basic quality filters Coverage Library SNPs Indels Combined 50x different 99.34%   90.94%   98.52%   50x same 99.36%   90.83%   98.52%   100x different 99.47%   90.60%   98.57%   100x same 99.47%   90.54%   98.56%   200x different 99.53%   90.23%   98.55%   10
  • 11. NA12882 Technical Technical Replicate A Replicate B 200x 200x (18 lanes) (18 lanes) 100x 100x 100x 100x (8 lanes) (8 lanes) (8 lanes) (8 lanes) 50x 50x 50x 50x 50x 50x 50x 50x ! Consistency across all the replicates –  How many replicates were able to be called at a given position? –  How many different genotypes were present at that position? 11
  • 12. Consistency among technical replicates Number of different genotypes     0   1   2   3   4   5   6   7   8   9   10   11   12   13   14   PASSing genotype quality filter 0   1.96   1   0.23   Number of replicates 2   0.21   0.0005   3   0.18   0.0006   3.5E-­‐05   4   0.16   0.0007   4.2E-­‐05   8.7E-­‐06   5   0.15   0.0007   4.5E-­‐05   1.3E-­‐05   3.5E-­‐06   6   0.15   0.0008   4.6E-­‐05   1.6E-­‐05   6.1E-­‐06   1.4E-­‐06   7   0.16   0.0008   4.9E-­‐05   1.8E-­‐05   8.8E-­‐06   3.0E-­‐06   8.2E-­‐07   8   0.16   0.0007   5.5E-­‐05   1.9E-­‐05   9.0E-­‐06   4.3E-­‐06   1.9E-­‐06   4.1E-­‐07   9   0.17   0.0007   5.6E-­‐05   2.0E-­‐05   1.1E-­‐05   5.2E-­‐06   2.5E-­‐06   1.4E-­‐06   3.7E-­‐07   10   0.20   0.0006   6.1E-­‐05   2.1E-­‐05   1.1E-­‐05   7.4E-­‐06   3.8E-­‐06   1.9E-­‐06   7.1E-­‐07   1.9E-­‐07   11   0.24   0.0006   6.9E-­‐05   2.6E-­‐05   1.4E-­‐05   9.4E-­‐06   6.4E-­‐06   3.7E-­‐06   1.5E-­‐06   3.7E-­‐07   7.4E-­‐08   12   0.32   0.0007   8.5E-­‐05   3.2E-­‐05   1.9E-­‐05   1.2E-­‐05   8.6E-­‐06   5.5E-­‐06   2.8E-­‐06   1.3E-­‐06   4.8E-­‐07   7.4E-­‐08   13   0.61   0.0010   1.2E-­‐04   4.3E-­‐05   2.8E-­‐05   1.9E-­‐05   1.5E-­‐05   1.1E-­‐05   7.4E-­‐06   4.6E-­‐06   2.0E-­‐06   6.7E-­‐07   2.2E-­‐07   14   95.07   0.0025   2.3E-­‐04   8.6E-­‐05   5.3E-­‐05   4.0E-­‐05   3.6E-­‐05   3.3E-­‐05   3.0E-­‐05   2.3E-­‐05   1.4E-­‐05   7.6E-­‐06   2.1E-­‐06   6.0E-­‐07   “Metal”   Genome   SNVs  from  a  50x  build   Gold   95.1%   94.80%   3,030,777   Silver   2.95%   4.15%   132,579   Copper   0.01%   1.05%   33,679   Lead   1.96%   12
  • 13. Genomic features overlapping with “metal” regions Genome   SNVs   CDS   medCDS   gold   95.07%   94.80%   96.91%   97.87%   silver   2.95%   4.15%   1.35%   1.11%   copper   0.01%   1.05%   0.003%   0.002%   lead   1.96%   0.00%   1.74%   1.02%   13
  • 14. A closer examination of “Copper” regions: those that had more than one genotype 86% of copper regions had just two different genotypes Type  of   inconsistency   Percentage   REF  /  het  SNV   37.40   REF  /  het  DEL   21.89   REF  /  het  INS   15.11   het  SNV  /  hom  SNV   5.38   het  DEL  /  hom  DEL   0.42   het  INS  /  hom  INS   1.43   Remaining   18.38   14
  • 15. Concordance in “metal” regions SNP concordance from two builds generated from different libraries 50x   100x   200x   ALL   99.34%   99.47%   99.53%   Gold   99.80%   99.94%   99.94%   Silver   85.00%   89.81%   93.80%   Copper   53.85%   67.85%   82.12%   Lead*   519   6,589   22,164   Non-gold regions of the genome point to areas that are not comprehensively/accurately assessed *  Absolute  values  more  revealing   15
  • 16. Concordance in “metal” regions Concordance of variants between two 100x builds from the same library SNPs   Indels   Both   Overall   99.47%   90.54%   98.56%   Gold   99.92%   96.77%   99.65%   Silver   90.65%   68.18%   86.32%   Copper   77.13%   57.11%   61.00%   Lead   73.44%   74.73%   73.88%   Indels need more attention 16
  • 17. Practical/Clinical/Medical Relevance 200x build comparison in medically-relevant CDS regions Percent Percent Metal   ALL   Same   Different   the Same   in Metal   Combined   1,187   1,182   5   99.58%   Gold   1,151   1,151   0   100.00%   96.97%   Silver   29   26   3   89.66%   2.44%   Copper   2   2   0   100.00%   0.17%   Lead   5   3   2   60.00%   0.42%   17
  • 18. Future Plans ! Classify inconsistent parts of the genome into: –  Alignment or read length issues §  Paralogous/repetitive/CNV regions §  Missed or wrong indel calls –  Depth of coverage –  Platform-specific artifacts ! Disseminate data/analyses to the research community ! Platform for developing better indel detection ! Error correction via haplotyping efforts ! Independent validation efforts ! Develop a database of variants and associated evidence 18
  • 19. Acknowledgements ! David Bentley ! Klaus Maisinger ! Sean Humphray ! Russell Grocock ! Mark Ross ! Peter Saffrey ! Nick Kerry ! Brad Sickler ! Nondas Fritzilas ! Pedro Cruz ! Phil Tedder ! Shankar Ajay ! Mike Eberle ! Marc Laurant ! Lisa Murray ! Semyon Kruglyak 19
  • 21. Accurate and comprehensive sequencing of pe Subramanian S. Ajay, Stephen C.J. Parker, Hatice Ozel Abaan, et al. Genome Res. published online July 19, 2011 Downloaded from genome.cshlp.org on July 20, 2011 - Published by Cold Spring Harbor Laboratory Press Access the most recent version at doi:10.1101/gr.123638.111 Research Accurate and comprehensive sequencing Supplemental http://genome.cshlp.org/content/suppl/2011/06 Material of personal genomes P<P Published online July 19, 2011 in advance of the p Subramanian S. Ajay,1 Stephen C.J. Parker,1 Hatice Ozel Abaan,1 Karin V. Fuentes Fajardo,2 and Elliott H. Margulies1,3,4 Freely available online through the Genome Resea Open Access 1 Genome Informatics Section, Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Email alerting Receive free email alerts when Bethesda, Maryland 20892, USA; 2Undiagnosed Diseases Program, Office of the Clinical Director, National Human Genome Research new articles cite th service Institute, National Institutes of Health, Bethesda, Maryland 20892, USA top right corner of the article or click here As whole-genome sequencing becomes commoditized and we begin to sequence and analyze personal genomes for clinical and diagnostic purposes, it is necessary to understand what constitutes a complete sequencing experiment for determining genotypes and detecting single-nucleotide variants. Here, we show that the current recommendation of ~30@ coverage is not adequate to produce genotype calls across a large fraction of the genome with acceptably low error rates. Our results Genotype calls are based on analyses of a clinical sample sequenced on two related Illumina platforms, GAIIx and HiSeq 2000, to a very high depth (126@). We used these data to establish genotype-calling filters that dramatically increase accuracy. We also empirically determined how the callable portion of the genome varies as a function of the amount of sequence data used. These results help provide a ‘‘sequencing guide’’ for future whole-genome sequencing decisions and metrics by which 50x coverage statistics should be reported. 50x [Supplemental material is available for this article.] Whole-genome sequencing and analysis is becoming part of a hg19  callable   a question that is extremely important as whole-genome se- Filter   translational research toolkit (Lupski et al. 2010; Sobreira et al. 2010) to investigate small-scale changes such as single-nucleotide In  both   Discordant   quencing and analysis of individual genomes transitions from primarily research-based projects to being used for clinical and variants (SNVs) and indels (Bentley et al. 2008; Wang et al. 2008; diagnostic applications. Additionally, we seek to understand the No  extra  filters   Kim et al. 2009; McKernan et al. 2009; Fujimoto et al. 2010; Lee 98.33%   46,580   relationship between the amount of sequence data generated and et al. 2010; Pleasance et al. 2010) in addition to large-scale events the resulting proportion of the genome where confident geno- With  alignment  and  genotype  Filters   such as chromosomal rearrangements (Campbell et al. 2008; Chen et al. 2008) and copy-number variation (Chiang et al. 2009; 93.13%   1,673   types can be derived—we refer to this as the ‘‘callable’’ portion, a term that is roughly equivalent to the 1000 Genomes Project’s Park et al. 2010). For both basic genome biology and clinical ‘‘accessible’’ portion. Using these sequencing metrics and geno- No  q20  Evidence  (MapQ1)   diagnostics, the trade-offs of data quality and quantity will de- 267   type-calling filters will help obviate the need for costly and time- termine what constitutes a ‘‘comprehensive and accurate’’ whole- consuming validation efforts. Currently, no empirically derived 21 genome analysis, especially for detecting SNVs. As whole-genome sequencing becomes commoditized, it will be important to deter- data sets exist for determining how much sequence data is needed to enable accurate detection of SNVs. NHGRI mine quantitative metrics to assess and describe the comprehen- To address this issue, we sequenced a blood sample from a siveness of an individual’s genome sequence. No such standards male individual with an undiagnosed clinical condition on two currently exist. related platforms—Illumina’s GAIIx and HiSeq 2000—to a total of