March 2013 Bioinformatics Working Group

Bioninformatics, Data Integration,
and Data Representation
Steve Sherry, NCBI WG Lead
Justin Zook, Presenter

WG Charge
• Develop strategy to • Building on NCBI-CDC
analyze each data set efforts in GeT-RM
• Develop plan for Project
integrating data and – developing repository
forming consensus – developing browser
variant calls and • shared work with
Performance Metrics WG
confidence estimates
– will be scaled for GiaB
• Develop consensus plan
• Building on NIST work
for data representation
for integration and
confidence estimation

Task 1: Inventory existing NA12878 data
• Assignees: NIST & GET-RM (NCBI)
• Timeline: Evolving document with version 1 by August 31

• Prepare a project document (e.g. Google doc) with matrix of all sources to include:
– Submitter details
– DNA source (cell line DNA / source collection DNA)
– Coverage characteristics
– Instrument platform
– Library design – fosmid, WGS, WES with ascertainment characteristics (fosmid) and target
design (WES) where known
– Source of data on internet
– Data release / availability date for new data sets & platforms
– Priority rank for data consolidation, filtering, and analysis

Task 1: NA12878 source discussion
Source Notes

Get RM Several available sources have been inventoried

Broad Many lab protocols & designs, NA12878 Hi Seq 60x and WES 100x, to check if assembly or
integrated analysis is available. BED available for WES design, but GATK reverse engineers
the target set.

SRA trace Chromatograms for Eicher’s fosmids. Made from DNA (not cell line) and assembled at wash
u, submitted to NCBI. Debate about which to use. Some fosmids picked because they are
tricky. G1K should have usable set available in 1 month. Note the set may have 1-2%
contamination from other fosmid libraries so be aware.

Xprize Fosmids covering 3-5% of genome and whole genome at 30x. Not all fosmids are done.
NA19239 available at GenomeSpace and NA12878 coming in 2 months.

Illumina Platinum reads at ENA 200x

CG 50-60x. on CG ftp site

Opgen Check if data are publicly available

Task 2: define quality for reads, runs,
lanes
• Assignees: Real Time Genomics and NCBI
• Timeline: Proposal for comment by August 31.

• The consortium should define a protocol for
quality filters on reads, runs and lanes.

• A prototype can be taken from 1000 genomes
and other current large scale studies.

Task 3 compile data.
• Assignees: NIST and NCBI
• Timeline: First set filtered by end of Sept.
• Iterative process to follow matrix priority

• Identify data hosting sites: Google, AWS, NCBI, EBI?

• Hosted data would provide centralized and synchronized stores for
filtered reads to use in reanalysis.

• Separate areas clearly labeled as “working” and “released”.

• All areas publicly accessible.

• Results from pipelines would be posted for group analysis.

Task 4 run pipelines
• Volunteers to run existing pipelines:
• Real time Genomics (Illumina & CG)
• NCBI (Illumina, SOLiD, and CG)
• Edge Genomics (Illumina and SOLiD)

• Timeline: As data are staged and callers installed

• References: GRCh37 with baits (G1K standard) no chr Y.
• GRCh38 to assess effects of alt haplotypes / fixes

• Run modules for all mutation types including mobile elements:
• Freebayes. GATK2, CG.
• Lobster for LINES, Alu’s CA repeats
• Hydra, pindel, proprietary for structural variation.

Task 5: consensus call integration.
• Recalibration: All or subset of dbSNP, e.g. G1K plus GoESP.
• Analysis to produce
– BAM, and archive compressed versions in cSRA & CRAM
– VCF – or –
– Novel compressed genome format: variants and probability that a
region matches the reference, i.e. confident nothing but reference in
region. [gVCF?]
– Quantile the genome by difficulty to align align / call variants
– Should analysis include WES on/off target specificity?
• Single file per individual.
• Future references may include tissue-specific DNA/RNA samples
• Think about epigenomic markup to future-proof resource

Working as a consortium
• Analysis group will need a listserv and periodic
discussions via standing conference calls.

• Google Doc area

• Data staging areas

• Further group tasks to be discussed via blogs,
telcon and maillists

Archival of Data & Pipelines
• Ongoing discussions
– Cloud data availability
– Data formats
• Pipeline archival
– startup commercial
services in this space
• robustness?
– Amazon
– Google
– Federal Resources?

Data Representation/Data Standards
• CDC taking lead in • Alternate approaches
convening standards – representation as
proposal assemblies
– focusing on VCF • not variant calls
• gVCF •

– assembling working
team from stakeholder
communities
– workshops?
– telecons?

March 2013 Bioinformatics Working Group

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Similaire à March 2013 Bioinformatics Working Group

Similaire à March 2013 Bioinformatics Working Group (20)

Plus de GenomeInABottle

Plus de GenomeInABottle (20)

March 2013 Bioinformatics Working Group