Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Sept2016 smallvar nist intro
1. Genome in a Bottle Workshop
Small Variant Data Jamboree
Justin Zook and Marc Salit
NIST Genome-Scale Measurements
Group
September 15, 2016
2. Integration Methods to Establish
Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
3. Integration Methods to Establish
Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
4. New calls (v3.3) vs. old calls (v2.19)
V3.3
• 3441361 match PG
• 550982 PG calls outside
high conf
• 124715 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 40 calls not in PG
– 60 extra PG calls
V2.19
• 3030717 match PG
• 1018795 PG calls outside
high conf
• 122359 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 87 calls not in PG
– 404 extra PG calls
5. New calls (v3.3) vs. old calls (v2.19)
V3.3
• 3441361 match PG
• 550982 PG calls outside
high conf
• 124715 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 40 calls not in PG
– 60 extra PG calls
V2.19
• 3030717 match PG
• 1018795 PG calls outside
high conf
• 122359 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 87 calls not in PG
– 404 extra PG calls
More high-confidence calls match Platinum Genomes
6. New calls (v3.3) vs. old calls (v2.19)
V3.3
• 3441361 match PG
• 550982 PG calls outside
high conf
• 124715 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 40 calls not in PG
– 60 extra PG calls
V2.19
• 3030717 match PG
• 1018795 PG calls outside
high conf
• 122359 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 87 calls not in PG
– 404 extra PG calls
Similar extra calls not in Platinum Genomes
7. New calls (v3.3) vs. old calls (v2.19)
V3.3
• 3441361 match PG
• 550982 PG calls outside
high conf
• 124715 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 40 calls not in PG
– 60 extra PG calls
V2.19
• 3030717 match PG
• 1018795 PG calls outside
high conf
• 122359 calls not in PG
• After excluding low
confidence regions and
regions around filtered PG
calls:
– 87 calls not in PG
– 404 extra PG calls
~80% fewer differences from PG in high confidence regions
8. New calls (v3.3) vs. old calls (v2.19)
Example vcf (verily) Stratified
V3.3
• 17% of SNPs not assessed
– 23% of SNPs in RefSeq coding
– 53% of SNPs in “bad
promoters”
• 78% of indels not assessed
– 0.7% difference rate
• 17% FP in regions
homologous to decoy
V2.19
• 27% of SNPs not assessed
– 36% of SNPs in RefSeq coding
– 82% of SNPs in “bad
promoters”
• 78% of indels not assessed
– 1.2% difference rate
• 0.2% FP in regions
homologous to decoy
9. Principles of Integration Process
• Form sensitive variant
calls from each dataset
• Define “callable regions”
for each callset
• Filter calls from each
method with annotations
unlike concordant calls
• Compare high-confidence
calls to other callsets and
manually inspect subset
of differences
– vs. pedigree-based calls
– vs. common pipelines
– Trio analysis
• When benchmarking a
new callset against ours,
most putative FPs/FNs
should actually be
FPs/FNs
10. Criteria for including new callsets
• Form sensitive variant
calls from each dataset
• Define “callable regions”
for each callset
• Good coverage and MapQ
• Use knowledge about
technology and manual
inspection to exclude repetitive
regions difficult for each dataset
• For new callsets, ensure most
FNs in callable regions relative
to current high-confidence calls
are questionable in the current
calls
• Filter calls from each
method with annotations
unlike concordant calls
– Annotations for which
outliers are expected to
indicate bias should be
selected for each callset
11. Ongoing work: With sufficient coverage, 10X
phasing seems to specifically identify most SNP
errors identified by pedigree phasing
Collaboration
with Nathan
Edwards and
Zhezhen Wang
at Georgetown
Univ
12. Ongoing work: How can we add more
complex events that are not normalized?
• Current integration only
breaks into primitives
– Some complex calls end up
uncertain
– If part of a complex variant
is uncertain, we exclude
the whole region
• 3 approaches
– Kevin Jacobs vgraph
• Merge all callsets into a
single graph
• Still need to work on partial
complex calls
– Chen Sun and Paul
Medvedev – varmatch
• Start with one callset and
match otther callers one at
a time, adding in new
variants from each
– Sean Irvine and Len Trigg,
RTG – vcfeval
• Presentation today
13. Ongoing work: GRCh38
• Draft calls for chr20 on
GRCh38
• Make calls on mapped
reads for Illumina and
10X
• Lift over calls for CG,
Ion, and SOLiD
• Preliminary
comparisons to PG
seem similar to those
for GRCh37
14. Ongoing/Future Work and Questions
• Integrate with pedigree
calls for NA12878
– Mike Eberle, Illumina
• Integrating phasing
information from family,
linked reads, etc.
– Sean Irvine/Len Trigg, RTG
• Integrate complex
variants
– Sean Irvine/Len Trigg, RTG
– Chen Sun/Paul Medvedev,
PSU
• Incorporate more calls in
difficult-to-map regions
– 10X
– Dovetail
– PacBio
• How to integrate indels
15-50bp?
• Using ALT loci
15. Acknowledgements
• NIST
– Marc Salit
– Jenny McDaniel
– Lindsay Vang
– David Catoe
• Genome in a Bottle
Consortium
• GA4GH Benchmarking
Team
• FDA
– Liz Mansfield
– Zivana Tevak
– David Litwack
Notes de l'éditeur
There were lengthy discussions on the need for reference materials for validation and proficiency testing, including well-characterized patient samples, in silico data sets, and genetically engineered samples that have a range of variants. While several speakers noted that in silico datasets are helpful when there aren't sufficient patient samples, most maintained that there is no substitute for having real samples.
However, there is a lack of funding for developing much-needed reference materials, everyone agreed. "To characterize these reference materials is much more expensive than laboratory validation because you really need to sequence them with more than one sequencing technology," observed Deanna Church from Personalis. "You need a sophisticated adjudication mechanism for resolving differences and a lot of the analysis going on right now is being done by post docs and grad students."
Commercial entities such as Horizon Discovery are developing reference materials, but such resources add R&D cost, Putcha reflected. "The problem comes back to reimbursement … and what you actually get paid effectively for all of this R&D work," he said. "Realistically, it seems like the market might have to create the incentive to actually do this, but payors also have to acknowledge that this becomes part and parcel of how you get a test to the market and how you keep it available."
Elizabeth Mansfield, director of personalized medicine at FDA's Office of In Vitro Diagnostics and Radiological Health, recognized the funding gap for reference materials. "This is a rate-limiting step in the development of next-generation sequencing as a very strong clinical application," she said.