2016 ashg giab poster

Introduction
Unprecedented characterization of a human trio
for new genomic reference materials
Justin Zook1, Marc Salit1, and the Genome in a Bottle Consortium
(1) Genome-Scale Measurements Group, National Institute of Standards and Technology, Gaithersburg, MD and Stanford, CA
•  NIST has hosted the Genome in a Bottle Consortium to develop well-
characterized, whole human genome reference samples that are an
enduring resource for benchmarking variant calls
•  Large batches of DNA from cell lines for these genomes are distributed
as NIST Reference Materials (RMs) with extensive public data
•  High-confidence small variant benchmark calls for 88-90% of the
reference have been released for 5 genomes released as NIST RMs
•  Currently developing benchmark calls for more difficult variants (e.g.,
larger indels, SVs) and in more difficult regions of the genome
•  GIAB has made a large, diverse set of data for 2 trios public to s:mulate
a community effort to characterize challenging variants and regions
•  We have formed an open, public analysis team to coordinate
characteriza:on efforts (e.g., collec:ng and evalua:ng SV calls from
different methods, manually cura:ng calls, and integra:ng calls)
Genomic Reference Materials and Data
Figure 1: Updated, simplified v3 integration process to form high-confidence SNPs,
indels, and homozygous reference regions for all GIAB genomes. v3.3 incorporates
10X Genomics to call difficult to map regions and GATK’s gvcf to call repeats.
Discussion/Future Work
•  New genomes: additional ancestries, tumor/normal genomes
•  Other analyses: methylation, phasing, STRs, difficult-to-map regions, chrY
•  What rules should be used for adding challenging high-confidence calls?
•  What performance metrics should be used when benchmarking SV accuracy?
•  Data described at: https://github.com/genome-in-a-bottle
•  New collaborations to characterize difficult regions and variants in these
genomes are welcome! Email jzook@nist.gov if you’re interested
Genome PGP ID Coriell ID NIST ID NIST RM #
CEPH Mother/
Daughter
N/A GM12878 HG001 RM8398
AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/
RM8392 (trio)
AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)
AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)
Chinese Son hu91BD69 GM24631 HG005 RM8393
Chinese Father huCA017E GM24694 N/A N/A
Chinese Mother hu38168C GM24695 N/A N/A
Dataset Character-
is:cs
Coverage Avail-
ability
Most useful
for…
Illumina
Paired-end
WGS
150x150bp
250x250bp
~300x/
individual
40-50x/
individual
SRA/FTP SNPs/indels/
some SVs
Complete
Genomics
100x/
individual
some SVs
SOLiD
5500W WGS
50bp single
end
70x/son SRA/FTP SNPs
Illumina
WES
100x100bp ~300x/
individual
SRA/FTP SNPs/indels
in exome
Ion Proton Exome 1000x/
individual
SRA/FTP SNPs/indels
in exome
Illumina
Mate pair
~6000 bp
insert
~30x/
individual
SRA/FTP SVs
Illumina
“moleculo”
Custom
library
~30x by long
fragments
FTP SVs/phasing/
assembly
Complete
Genomics
LFR 100x/
individual
phasing
10X Linked reads 45-75x/
individual
FTP mapping/
phasing/SVs/
assembly
Dovetail Chicago ~50x/AJ indiv FTP scaffolding
PacBio ~10kb reads ~70x on AJ
son, ~30x on
AJ parents
SRA/FTP SVs/phasing/
assembly/
STRs
Oxford
Nanopore
5.8kb 2D
reads
0.02x on AJ
son
FTP SVs/assembly
Nabsys 2.0 ~100kbp
N50 maps
70x on AJ son Collabor-
a:ons
SVs/assembly
BioNano
Genomics
200-250kbp
op:cal map
reads
~100x/AJ
individual;
57x on HG005
FTP SVs/assembly
Long-range WGS WES Long reads Mapping Paired-end WGS
De novo assemblies for AJ Son
SNPs, indels, and homozygous reference calls
Data Method
Con:g
N50
Scaffold
N50
Number
Scaffolds
Total
Size
PacBio Falcon 5.3 Mb 5.3 Mb 13231 3.04 Gb
PacBio PBcR 4.5 Mb 4.5 Mb 12523 2.99 Gb
PacBio+
BioNano
Falcon+
BioNano 4.1 Mb 22.7 Mb 478 2.38 Gb
PacBio+
Dovetail
Falcon+
HiRise 5.3 Mb 12.9 Mb 12459 3.04 Gb
PacBio+
Dovetail
PBcR+
HiRise 4.1 Mb 20.6 Mb 10491 2.99 Gb
Illumina DISCOVAR 81 kb 149 kb 1.06M 3.13 Gb
Illumina+
Dovetail
DISCOVAR
+HiRise 85 kb 12.9 Mb 1.03M 3.15 Gb
10X Supernova 106 kb 15.2 Mb 1360 2.73 Gb
Find
sensi:ve
variant calls
and callable
regions for
each dataset
Find
“consensus”
calls with
support
from 2+
technologies
(and no
other
technologies
disagree)
Use
“consensus”
calls to train
one-class
model for
each dataset
and find
“outliers”
that are less
trustworthy
for each
dataset
Find high-
confidence
calls by
using
callable
regions and
“outliers” to
arbitrate
between
datasets
when they
disagree
Find high-
confidence
regions by
taking union
of callable
regions and
subtrac:ng
uncertain
variants and
difficult
regions
Table 1: Genomes currently being characterized by GIAB
Table 2: Data collected from AJ and/or Chinese trios
Credits for assemblies:
Ali Bashir, Mt. Sinai
Jason Chin, PacBio
Alex Has:e, BioNano
Serge Koren, NHGRI
Adam Phillippy, NHGRI
Kareina Dill, Dovetail
Noushin Ghaffari, TAMU
10X Genomics
Zook et al., Scien&fic Data, 2016.
kp://kp-trace.ncbi.nlm.nih.gov/giab/kp/data
“sequence-
resolved”
calls
Discovery
Imprecise
SV calls
Sequence-
based
comparison
SV
corrobora:on
methods (e.g.,
parliament,
svviz, nabsys,
bionano)
Heuris:cs to
form :ers of
benchmark
SVs
Machine
learning to
form
benchmark
SVs
Comparison
of all
candidate
calls
(SURVIVOR/
svcompare)
Comparison Corrobora:on Benchmark calls
SV refinement?
(e.g., MetaSV,
parliament,
PBRefine)
Paper about
calls and
comparisons in
~Nov?
Structural Variants
Calls HC Regions HC Calls
Concordant
with PG
NIST-only
in beds
PG-only in
beds PG-only
v2.19 2.22 Gb 3153247 3030703 87 404 1018795
v3.1 2.55 Gb 3453085 3330275 71 82 719223
v3.2.2 2.53 Gb 3512990 3391783 57 52 657715
v3.3 2.57 Gb 3566076 3441361 40 60 608137
Proposed new integration process
Proposed :ers of benchmark calls:
1.  2+ techs agree on exact sequence
of SV and corrobora:on methods
don’t disprove
2.  2+ techs agree on 9x% of the
sequence of SV and corrobora:on
methods don’t disprove
3.  1 tech is sequence-resolved and at
least one other tech corroborates
4.  No sequence-resolved methods
but corroborated by 2+ techs
5.  Ques:onable variants
6.  Likely non-SV regions
New calls for
GRCh38 on
FTP!
Merge
dele:ons
within 1kb
Rank calls
by
closeness of
predicted
size to
median size
and select
call in each
region from
best callset
Find calls
supported
by 2+ techs
with size
within 20%
Filter calls
overlapping
seg dups,
reference
N’s, or with
call with
predicted
size 2x
larger
Preliminary deletion integration process
Pre-
filtered
calls
Post-
filtered
calls
<50bp 2627 2548
50-100bp 1600 1448
100-1000bp 2306 1996
1kb-3kb 385 297
>3kbp 389 262
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_DraftIntegratedDeletionsgt19bp_v0.1.8
Standardized benchmarking tools and bed files of difficult regions from GA4GH:
https://github.com/ga4gh/benchmarking-tools/
Assembly-based SV callers:
MSPAC
Assembly:cs
PBRefine
IMPORTANT NOTE:
These are drak assemblies and not
intended for comparing methods.

2016 ashg giab poster

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2016 ashg giab poster

Similar to 2016 ashg giab poster (20)

More from GenomeInABottle

More from GenomeInABottle (20)

Recently uploaded

Recently uploaded (20)

2016 ashg giab poster