This document outlines tasks for a working group to analyze and integrate genomic data sets for the individual NA12878. The goals are to:
1. Develop a strategy to analyze individual data sets and a plan to integrate the data and form consensus variant calls and confidence estimates.
2. Inventory existing NA12878 data sources and characterize them.
3. Define quality filters for reads, runs, and lanes based on existing large-scale studies.
4. Compile the first set of filtered NA12878 data from various sources by the end of September.
5. Run existing variant calling pipelines on the compiled data set using references genomes.
2. WG Charge
• Develop strategy to • Building on NCBI-CDC
analyze each data set efforts in GeT-RM
• Develop plan for Project
integrating data and – developing repository
forming consensus – developing browser
variant calls and • shared work with
Performance Metrics WG
confidence estimates
– will be scaled for GiaB
• Develop consensus plan
• Building on NIST work
for data representation
for integration and
confidence estimation
3. Task 1: Inventory existing NA12878 data
• Assignees: NIST & GET-RM (NCBI)
• Timeline: Evolving document with version 1 by August 31
• Prepare a project document (e.g. Google doc) with matrix of all sources to include:
– Submitter details
– DNA source (cell line DNA / source collection DNA)
– Coverage characteristics
– Instrument platform
– Library design – fosmid, WGS, WES with ascertainment characteristics (fosmid) and target
design (WES) where known
– Source of data on internet
– Data release / availability date for new data sets & platforms
– Priority rank for data consolidation, filtering, and analysis
4. Task 1: NA12878 source discussion
Source Notes
Get RM Several available sources have been inventoried
Broad Many lab protocols & designs, NA12878 Hi Seq 60x and WES 100x, to check if assembly or
integrated analysis is available. BED available for WES design, but GATK reverse engineers
the target set.
SRA trace Chromatograms for Eicher’s fosmids. Made from DNA (not cell line) and assembled at wash
u, submitted to NCBI. Debate about which to use. Some fosmids picked because they are
tricky. G1K should have usable set available in 1 month. Note the set may have 1-2%
contamination from other fosmid libraries so be aware.
Xprize Fosmids covering 3-5% of genome and whole genome at 30x. Not all fosmids are done.
NA19239 available at GenomeSpace and NA12878 coming in 2 months.
Illumina Platinum reads at ENA 200x
CG 50-60x. on CG ftp site
Opgen Check if data are publicly available
5. Task 2: define quality for reads, runs,
lanes
• Assignees: Real Time Genomics and NCBI
• Timeline: Proposal for comment by August 31.
• The consortium should define a protocol for
quality filters on reads, runs and lanes.
• A prototype can be taken from 1000 genomes
and other current large scale studies.
6. Task 3 compile data.
• Assignees: NIST and NCBI
• Timeline: First set filtered by end of Sept.
• Iterative process to follow matrix priority
• Identify data hosting sites: Google, AWS, NCBI, EBI?
• Hosted data would provide centralized and synchronized stores for
filtered reads to use in reanalysis.
• Separate areas clearly labeled as “working” and “released”.
• All areas publicly accessible.
• Results from pipelines would be posted for group analysis.
7. Task 4 run pipelines
• Volunteers to run existing pipelines:
• Real time Genomics (Illumina & CG)
• NCBI (Illumina, SOLiD, and CG)
• Edge Genomics (Illumina and SOLiD)
• Timeline: As data are staged and callers installed
• References: GRCh37 with baits (G1K standard) no chr Y.
• GRCh38 to assess effects of alt haplotypes / fixes
• Run modules for all mutation types including mobile elements:
• Freebayes. GATK2, CG.
• Lobster for LINES, Alu’s CA repeats
• Hydra, pindel, proprietary for structural variation.
8. Task 5: consensus call integration.
• Recalibration: All or subset of dbSNP, e.g. G1K plus GoESP.
• Analysis to produce
– BAM, and archive compressed versions in cSRA & CRAM
– VCF – or –
– Novel compressed genome format: variants and probability that a
region matches the reference, i.e. confident nothing but reference in
region. [gVCF?]
– Quantile the genome by difficulty to align align / call variants
– Should analysis include WES on/off target specificity?
• Single file per individual.
• Future references may include tissue-specific DNA/RNA samples
• Think about epigenomic markup to future-proof resource
9. Working as a consortium
• Analysis group will need a listserv and periodic
discussions via standing conference calls.
• Google Doc area
• Data staging areas
• Further group tasks to be discussed via blogs,
telcon and maillists
10. Archival of Data & Pipelines
• Ongoing discussions
– Cloud data availability
– Data formats
• Pipeline archival
– startup commercial
services in this space
• robustness?
– Amazon
– Google
– Federal Resources?
11. Data Representation/Data Standards
• CDC taking lead in • Alternate approaches
convening standards – representation as
proposal assemblies
– focusing on VCF • not variant calls
• gVCF •
– assembling working
team from stakeholder
communities
– workshops?
– telecons?