KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
140127 GIAB update and NIST high-confidence calls
1. Genome in a Bottle Consortium
Progress Update
January 27, 2014
Justin Zook, Marc Salit, and the Genome in a
Bottle Consortium
2. Whole Genome RMs vs.
Current Validation Methods
• Sanger confirmation
– Limited by number of sites (and sometimes it’s wrong)
• High depth NGS confirmation
– May have same systematic errors
• Genotyping microarrays
– Limited to known (easier) variants
– Problems with neighboring “complex” variants, duplications
• Mendelian inheritance
– Can’t account for some systematic errors
• Simulated data
– Generally not very representative of errors in real data
• Ti/Tv
– Varies by region of genome, and only gives overall statistic
2
3. Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in
confident regions
• Include as much of the genome as possible in
the confident regions (i.e., don’t just take the
intersection)
• Avoid bias towards any particular platform
– take advantage of strengths of each platform
• Avoid bias towards any particular
bioinformatics algorithms
3
5. Integration of Data to
Form Highly Confident Genotype Calls
Candidate variants
Find all possible variant sites
Concordant variants
Find concordant sites across multiple datasets
Find characteristics
of bias
Identify sites with atypical characteristics signifying
sequencing, mapping, or alignment bias
Arbitrate using
evidence of bias
For each site, remove datasets with decreasingly atypical
characteristics until all datasets agree
Confidence Level
Even if all datasets agree, identify them as uncertain if
few have typical characteristics, or if they fall in known
segmental duplications, SVs, or long repeats
5
6. Verification of “Highly Confident”
Genotype accuracy
• Sanger sequencing
– 100% accuracy but only 100s of sites
• X Prize Fosmid sequencing
– Sometimes call only part of a complex variant
• Microarrays
– Differences appear to be FP or FN in arrays
• Broad 250bp HaplotypeCaller
– Very highly concordant
• Platinum genomes pedigree SNPs
– Some systematic errors are inherited; different representations
of complex variants
• Real Time Genomics SNPs and indels
– Some interesting sites called by RTG complex caller
6
7. GCAT – Interactive Performance
Metrics
• NIST is working with
GCAT to use our highly
confident variant calls
• Assess performance of
many combinations of
mappers and variant
callers
• www.bioplanet.com/gc
at
Improvement of FreeBayes over 1 year with indels
7
8. Why do calls differ from our highly
confident genotypes?
Apparent False Positives
• Platform-specific systematic
sequencing errors for SNPs
• Analysis-specific
• Difficult to map regions
• Indels in long
homopolymers
Apparent False Negatives
• Different complex variant
representation
• Near indels
• Inside repeats
8
9. Complex variants have multiple correct
unphased representations
BWA
T
insertion
CGTools
Ref:
FP indels
TCTCT
insertion
Traditional
comparison
0.38%
(610)
100%
(915)
6.5%
(733)
Comparison
with
realignment
ssaha2
Novoalign
FP SNPs FP MNPs
0.15%
(249)
4.2%
(38)
2.6%
(298)
• ~225,000 highly confident
variants are within 10bp of
another variant
• FPs and FNs are significantly
enriched for complex variants
• RTG vcfeval can fix this issue!
9
12. Structural variant analytical approach
Depth of coverage (DOC)
Control-FREEC
CnD
Paired-end mapping (PEM)
Breakdancer
Split read (SR)
Pindel
Assembly based (AS)
Velvet
ABySS
Combination
Genome-STRiP
SVMerge
List of
structural
variant calls
14. Validation parameters for each SV
• Coverage (mean and standard deviation)
• Paired-end distance/insert size (mean and
standard deviation)
• # of discordant paired-ends
• Soft clipping of the reads (mean and standard
deviation)
• Mapping quality (mean and standard deviation)
• # of heterozygous and homozygous SNP
genotype calls
15. Challenges with assessing
performance
• All variant types are not
equal
• All regions of the genome
are not equal
– Homopolymers, STRs, dupli
cations
– Can be similar or different
in different genomes
• Labeling difficult variants
as uncertain leads to
higher apparent accuracy
when assessing
performance
• Genotypes fall in 3+
categories (not
positive/negative)
– standard diagnostic
accuracy measures not
well posed
15
16. Pedigree calls
• RTG and Illumina Platinum
Genomes working on this
• Sequence
NA12878, husband, and 11
children to identify high
confidence variants
– Identify cross-over events
– Determine if genotypes are
consistent with inheritance
• Should we integrate these
with the NIST high-confidence
genotypes?
• Should we find larger families
for future genomes?
• See afternoon presentations!
Source: Mike Eberle, Illumina
16
18. GIAB Characterization of pilot RM
•
•
•
•
NIST – 300x 150x150bp HiSeq (from 6 vials)
NIST – 100x 75bp ECC SOLiD 5500W
Illumina – 50x 100x100bp HiSeq
Complete Genomics – Normal and LFR (nonRM)
• Garvan Institute – Illumina exome
• NCI – Ion Proton whole genome
• INOVA – Infinium SNP/CNV array
19. Homogeneity and Stability
Homogeneity
• Multiplex First and last vial
– 3 libraries x 33x HiSeq each
• Multiplex 4 Random vials
– 2 libraries x 12.5x HiSeq each
• Compare variability due to:
–
–
–
–
–
–
vial
library
day
flow cell
lane
sampling
• Run PFGE on each vial for size
Stability
• Run PFGE to detect DNA
degradation
• Freeze-thaw 2 and 5 times
• Vortex for 10s
• 4°C for 2 and 8 weeks
• 37°C for 2 and 8 weeks
20. FTP site and Amazon S3
• NCBI is hosting fastq, bam, and vcf files on the
giab ftp site
• These data are mirrored to Amazon S3, so we
encourage you to take advantage of this!
21. Pilot Reference Material
• High-confidence calls are available on the ftp
site and are already being used
• NIST plans to release this as a NIST Reference
Material in the next couple months
22. Future Directions
• Characterize more
“difficult” regions/variants
• Structural variants
• Compare to pedigree calls
• Examine potentially
clinically relevant
regions/variants in RMs
• Use long-read technologies
–
–
–
–
–
Moleculo
CG LFR
PacBio
BioNano Genomics
future technologies??
• Use glia/platypus to realign
reads to candidate variants
• Analyze interlaboratory
study data
• Characterize PGP genomes
–
–
–
–
Ashkenazim trio
son in Asian trio
DNA at NIST in Jan-Feb 2014
Volunteers to sequence?
• Select future genomes
• Tumor-normal?
23. Topic #1: Moving beyond the easy
regions/variants
Presentations
• Emerging Technologies
–
–
–
–
PacBio
Complete Genomics LFR
Moleculo
BioNano Genomics
• Structural Variants
– Bina Technologies
Topics
• Structural Variants
• Phasing
• Validation
• Where should we set the
threshold(s) for confidence?
24. Topic #2: Cancer and Future Genomes
Cancer
• Spike-ins
• Mixtures of normal cell lines
• Tumor-normal cell line pair
• Transriptome controls
Priorities for Future Genomes
• Diverse ancestry groups
• Larger families
• Recruitment with consent
for commercialization
• How many genomes?
• Should the parents be NIST
Reference Materials, or only
the child?
25. Working Group Questions
RM Selection & Design
• Spike-in controls
• FFPE
• Commercial RMs
• ABRF interlaboratory study
• Should we prioritize one or
two genomes?
RM Characterization
• Production mode for new
trios
– Pilot was characterized by
Illumina, SOLiD, Ion
Proton, and Complete
Genomics
– What resources should we
invest in measurements for
each new family?
26. Working Group Questions
Bioinformatics
• Storing data/pipelines
– Suggestions for ftp structure
– Data submission/accessioning
process
– Data model for genomic data
– Archiving pipelines and
reproducible research
• GRCh38
• How to use pedigree calls for pilot
genome?
• Clones for targeted regions (hard
regions if not whole genome)
• In which difficult regions should
we focus our characterization?
Performance Metrics
• Target audience
• Requirements for user
interface
– Establishing truth set(s)
– Inputs/Outputs
– Visualization
• Integration with GeT-RM
Editor's Notes
----- Meeting Notes (5/28/13 17:05) -----ask heng for decoy