SlideShare une entreprise Scribd logo
1  sur  12
Bioninformatics, Data Integration,
    and Data Representation
     Steve Sherry, NCBI WG Lead
        Justin Zook, Presenter
WG Charge
• Develop strategy to       • Building on NCBI-CDC
  analyze each data set       efforts in GeT-RM
• Develop plan for            Project
  integrating data and        – developing repository
  forming consensus           – developing browser
  variant calls and              • shared work with
                                   Performance Metrics WG
  confidence estimates
                              – will be scaled for GiaB
• Develop consensus plan
                            • Building on NIST work
  for data representation
                              for integration and
                              confidence estimation
Task 1: Inventory existing NA12878 data
•   Assignees:   NIST & GET-RM (NCBI)
•   Timeline: Evolving document with version 1 by August 31

•   Prepare a project document (e.g. Google doc) with matrix of all sources to include:
     – Submitter details
     – DNA source (cell line DNA / source collection DNA)
     – Coverage characteristics
     – Instrument platform
     – Library design – fosmid, WGS, WES with ascertainment characteristics (fosmid) and target
        design (WES) where known
     – Source of data on internet
     – Data release / availability date for new data sets & platforms
     – Priority rank for data consolidation, filtering, and analysis
Task 1: NA12878 source discussion
Source      Notes

Get RM      Several available sources have been inventoried


Broad       Many lab protocols & designs, NA12878 Hi Seq 60x and WES 100x, to check if assembly or
            integrated analysis is available. BED available for WES design, but GATK reverse engineers
            the target set.

SRA trace   Chromatograms for Eicher’s fosmids. Made from DNA (not cell line) and assembled at wash
            u, submitted to NCBI. Debate about which to use. Some fosmids picked because they are
            tricky. G1K should have usable set available in 1 month. Note the set may have 1-2%
            contamination from other fosmid libraries so be aware.

Xprize      Fosmids covering 3-5% of genome and whole genome at 30x. Not all fosmids are done.
            NA19239 available at GenomeSpace and NA12878 coming in 2 months.


Illumina    Platinum reads at ENA 200x

CG          50-60x. on CG ftp site

Opgen       Check if data are publicly available
Task 2: define quality for reads, runs,
                lanes
• Assignees: Real Time Genomics and NCBI
• Timeline: Proposal for comment by August 31.

• The consortium should define a protocol for
  quality filters on reads, runs and lanes.

• A prototype can be taken from 1000 genomes
  and other current large scale studies.
Task 3 compile data.
• Assignees: NIST and NCBI
• Timeline:     First set filtered by end of Sept.
•      Iterative process to follow matrix priority

• Identify data hosting sites: Google, AWS, NCBI, EBI?

• Hosted data would provide centralized and synchronized stores for
  filtered reads to use in reanalysis.

• Separate areas clearly labeled as “working” and “released”.

• All areas publicly accessible.

• Results from pipelines would be posted for group analysis.
Task 4 run pipelines
• Volunteers to run existing pipelines:
• Real time Genomics (Illumina & CG)
• NCBI (Illumina, SOLiD, and CG)
• Edge Genomics (Illumina and SOLiD)

• Timeline:    As data are staged and callers installed

• References: GRCh37 with baits (G1K standard) no chr Y.
•      GRCh38 to assess effects of alt haplotypes / fixes

• Run modules for all mutation types including mobile elements:
• Freebayes. GATK2, CG.
• Lobster for LINES, Alu’s CA repeats
• Hydra, pindel, proprietary for structural variation.
Task 5: consensus call integration.
• Recalibration: All or subset of dbSNP, e.g. G1K plus GoESP.
• Analysis to produce
   – BAM, and archive compressed versions in cSRA & CRAM
   – VCF – or –
   – Novel compressed genome format: variants and probability that a
     region matches the reference, i.e. confident nothing but reference in
     region. [gVCF?]
   – Quantile the genome by difficulty to align align / call variants
   – Should analysis include WES on/off target specificity?
• Single file per individual.
• Future references may include tissue-specific DNA/RNA samples
• Think about epigenomic markup to future-proof resource
Working as a consortium
• Analysis group will need a listserv and periodic
  discussions via standing conference calls.

• Google Doc area

• Data staging areas

• Further group tasks to be discussed via blogs,
  telcon and maillists
Archival of Data & Pipelines
• Ongoing discussions
   – Cloud data availability
   – Data formats
• Pipeline archival
   – startup commercial
     services in this space
      • robustness?
   – Amazon
   – Google
   – Federal Resources?
Data Representation/Data Standards
• CDC taking lead in        • Alternate approaches
  convening standards         – representation as
  proposal                      assemblies
  – focusing on VCF              • not variant calls
     • gVCF                      •

  – assembling working
    team from stakeholder
    communities
  – workshops?
  – telecons?
NCBI GeT-RM Genome Browser

Contenu connexe

En vedette

Experiences with logic programming in bioinformatics
Experiences with logic programming in bioinformaticsExperiences with logic programming in bioinformatics
Experiences with logic programming in bioinformaticsChris Mungall
 
Building the Hymenoptera Anatomy Ontology through exploration of the Journal ...
Building the Hymenoptera Anatomy Ontology through exploration of the Journal ...Building the Hymenoptera Anatomy Ontology through exploration of the Journal ...
Building the Hymenoptera Anatomy Ontology through exploration of the Journal ...Katja C. Seltmann
 
Bio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesBio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesJanna Hastings
 
Giab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptxGiab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptxGenomeInABottle
 
Aug2015 Giab nist integration methods
Aug2015 Giab nist integration methodsAug2015 Giab nist integration methods
Aug2015 Giab nist integration methodsGenomeInABottle
 
Giab jan2016 analysis team breakout SNP indel update zook
Giab jan2016 analysis team breakout SNP indel update zookGiab jan2016 analysis team breakout SNP indel update zook
Giab jan2016 analysis team breakout SNP indel update zookGenomeInABottle
 
Aug2014 giab intro slides
Aug2014 giab intro slidesAug2014 giab intro slides
Aug2014 giab intro slidesGenomeInABottle
 
Jan2015 ga4 gh variant comparison
Jan2015 ga4 gh variant comparisonJan2015 ga4 gh variant comparison
Jan2015 ga4 gh variant comparisonGenomeInABottle
 
Aug2015 analysis team 02 can alkan sv
Aug2015 analysis team 02 can alkan svAug2015 analysis team 02 can alkan sv
Aug2015 analysis team 02 can alkan svGenomeInABottle
 
Aug2015 analysis team 03 elizabeth henaff
Aug2015 analysis team 03 elizabeth henaffAug2015 analysis team 03 elizabeth henaff
Aug2015 analysis team 03 elizabeth henaffGenomeInABottle
 
Aug2014 giab status update and wg charge
Aug2014 giab status update and wg chargeAug2014 giab status update and wg charge
Aug2014 giab status update and wg chargeGenomeInABottle
 
Aug2014 working group report performance metrics
Aug2014 working group report performance metricsAug2014 working group report performance metrics
Aug2014 working group report performance metricsGenomeInABottle
 
Mar2013 Reference Material Selection Working Group
Mar2013 Reference Material Selection Working GroupMar2013 Reference Material Selection Working Group
Mar2013 Reference Material Selection Working GroupGenomeInABottle
 
Jan2015 GIAB intro, Update, and Data Analysis Planning
Jan2015 GIAB intro, Update, and Data Analysis PlanningJan2015 GIAB intro, Update, and Data Analysis Planning
Jan2015 GIAB intro, Update, and Data Analysis PlanningGenomeInABottle
 
Aug2013 NIST highly confident genotype calls for NA12878
Aug2013 NIST highly confident genotype calls for NA12878Aug2013 NIST highly confident genotype calls for NA12878
Aug2013 NIST highly confident genotype calls for NA12878GenomeInABottle
 

En vedette (20)

Experiences with logic programming in bioinformatics
Experiences with logic programming in bioinformaticsExperiences with logic programming in bioinformatics
Experiences with logic programming in bioinformatics
 
Building the Hymenoptera Anatomy Ontology through exploration of the Journal ...
Building the Hymenoptera Anatomy Ontology through exploration of the Journal ...Building the Hymenoptera Anatomy Ontology through exploration of the Journal ...
Building the Hymenoptera Anatomy Ontology through exploration of the Journal ...
 
Bio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesBio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challenges
 
Giab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptxGiab aug2015 intro and update 150821.pptx
Giab aug2015 intro and update 150821.pptx
 
Aug2015 Giab nist integration methods
Aug2015 Giab nist integration methodsAug2015 Giab nist integration methods
Aug2015 Giab nist integration methods
 
Giab jan2016 analysis team breakout SNP indel update zook
Giab jan2016 analysis team breakout SNP indel update zookGiab jan2016 analysis team breakout SNP indel update zook
Giab jan2016 analysis team breakout SNP indel update zook
 
Aug2014 giab intro slides
Aug2014 giab intro slidesAug2014 giab intro slides
Aug2014 giab intro slides
 
Jan2015 ga4 gh variant comparison
Jan2015 ga4 gh variant comparisonJan2015 ga4 gh variant comparison
Jan2015 ga4 gh variant comparison
 
Aug2015 analysis team 02 can alkan sv
Aug2015 analysis team 02 can alkan svAug2015 analysis team 02 can alkan sv
Aug2015 analysis team 02 can alkan sv
 
Aug2015 analysis team 03 elizabeth henaff
Aug2015 analysis team 03 elizabeth henaffAug2015 analysis team 03 elizabeth henaff
Aug2015 analysis team 03 elizabeth henaff
 
Aug2014 giab status update and wg charge
Aug2014 giab status update and wg chargeAug2014 giab status update and wg charge
Aug2014 giab status update and wg charge
 
Aug2014 working group report performance metrics
Aug2014 working group report performance metricsAug2014 working group report performance metrics
Aug2014 working group report performance metrics
 
Mar2013 Reference Material Selection Working Group
Mar2013 Reference Material Selection Working GroupMar2013 Reference Material Selection Working Group
Mar2013 Reference Material Selection Working Group
 
Uberon PAG 2013
Uberon PAG 2013Uberon PAG 2013
Uberon PAG 2013
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
 
Sept2016 sv pb_honey
Sept2016 sv pb_honeySept2016 sv pb_honey
Sept2016 sv pb_honey
 
Jan2016 horizon GIAB
Jan2016 horizon GIABJan2016 horizon GIAB
Jan2016 horizon GIAB
 
Jan2015 GIAB intro, Update, and Data Analysis Planning
Jan2015 GIAB intro, Update, and Data Analysis PlanningJan2015 GIAB intro, Update, and Data Analysis Planning
Jan2015 GIAB intro, Update, and Data Analysis Planning
 
Sept2016 sv illumina
Sept2016 sv illuminaSept2016 sv illumina
Sept2016 sv illumina
 
Aug2013 NIST highly confident genotype calls for NA12878
Aug2013 NIST highly confident genotype calls for NA12878Aug2013 NIST highly confident genotype calls for NA12878
Aug2013 NIST highly confident genotype calls for NA12878
 

Similaire à March 2013 Bioinformatics Working Group

Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...GenomeInABottle
 
Aug2013 bioinformatics working group
Aug2013 bioinformatics working groupAug2013 bioinformatics working group
Aug2013 bioinformatics working groupGenomeInABottle
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotechAdam Muise
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSEd Dodds
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqEnis Afgan
 
Rack Cluster Deployment for SDSC Supercomputer
Rack Cluster Deployment for SDSC SupercomputerRack Cluster Deployment for SDSC Supercomputer
Rack Cluster Deployment for SDSC SupercomputerRebekah Rodriguez
 
Giab ashg webinar 160224
Giab ashg webinar 160224Giab ashg webinar 160224
Giab ashg webinar 160224GenomeInABottle
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupGenomeInABottle
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...mestato
 
Aug2013 NIST program slides
Aug2013 NIST program slidesAug2013 NIST program slides
Aug2013 NIST program slidesGenomeInABottle
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopMarcus Hanwell
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020GEO Analytics Canada
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobus
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASAIan Foster
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marcGenomeInABottle
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 

Similaire à March 2013 Bioinformatics Working Group (20)

Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
 
Aug2013 bioinformatics working group
Aug2013 bioinformatics working groupAug2013 bioinformatics working group
Aug2013 bioinformatics working group
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
Rack Cluster Deployment for SDSC Supercomputer
Rack Cluster Deployment for SDSC SupercomputerRack Cluster Deployment for SDSC Supercomputer
Rack Cluster Deployment for SDSC Supercomputer
 
Giab ashg webinar 160224
Giab ashg webinar 160224Giab ashg webinar 160224
Giab ashg webinar 160224
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working Group
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
 
Aug2013 NIST program slides
Aug2013 NIST program slidesAug2013 NIST program slides
Aug2013 NIST program slides
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the Desktop
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
HiPipe Professional
HiPipe ProfessionalHiPipe Professional
HiPipe Professional
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marc
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 

Plus de GenomeInABottle

GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023GenomeInABottle
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGenomeInABottle
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923GenomeInABottle
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907GenomeInABottle
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...GenomeInABottle
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGenomeInABottle
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020GenomeInABottle
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGenomeInABottle
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGenomeInABottle
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGenomeInABottle
 
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGenomeInABottle
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGenomeInABottle
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGenomeInABottle
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyGenomeInABottle
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 

Plus de GenomeInABottle (20)

2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
 
GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023
 
Stratomod ASHG 2023
Stratomod ASHG 2023Stratomod ASHG 2023
Stratomod ASHG 2023
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussion
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant poster
 
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant poster
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assembly
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 

March 2013 Bioinformatics Working Group

  • 1. Bioninformatics, Data Integration, and Data Representation Steve Sherry, NCBI WG Lead Justin Zook, Presenter
  • 2. WG Charge • Develop strategy to • Building on NCBI-CDC analyze each data set efforts in GeT-RM • Develop plan for Project integrating data and – developing repository forming consensus – developing browser variant calls and • shared work with Performance Metrics WG confidence estimates – will be scaled for GiaB • Develop consensus plan • Building on NIST work for data representation for integration and confidence estimation
  • 3. Task 1: Inventory existing NA12878 data • Assignees: NIST & GET-RM (NCBI) • Timeline: Evolving document with version 1 by August 31 • Prepare a project document (e.g. Google doc) with matrix of all sources to include: – Submitter details – DNA source (cell line DNA / source collection DNA) – Coverage characteristics – Instrument platform – Library design – fosmid, WGS, WES with ascertainment characteristics (fosmid) and target design (WES) where known – Source of data on internet – Data release / availability date for new data sets & platforms – Priority rank for data consolidation, filtering, and analysis
  • 4. Task 1: NA12878 source discussion Source Notes Get RM Several available sources have been inventoried Broad Many lab protocols & designs, NA12878 Hi Seq 60x and WES 100x, to check if assembly or integrated analysis is available. BED available for WES design, but GATK reverse engineers the target set. SRA trace Chromatograms for Eicher’s fosmids. Made from DNA (not cell line) and assembled at wash u, submitted to NCBI. Debate about which to use. Some fosmids picked because they are tricky. G1K should have usable set available in 1 month. Note the set may have 1-2% contamination from other fosmid libraries so be aware. Xprize Fosmids covering 3-5% of genome and whole genome at 30x. Not all fosmids are done. NA19239 available at GenomeSpace and NA12878 coming in 2 months. Illumina Platinum reads at ENA 200x CG 50-60x. on CG ftp site Opgen Check if data are publicly available
  • 5. Task 2: define quality for reads, runs, lanes • Assignees: Real Time Genomics and NCBI • Timeline: Proposal for comment by August 31. • The consortium should define a protocol for quality filters on reads, runs and lanes. • A prototype can be taken from 1000 genomes and other current large scale studies.
  • 6. Task 3 compile data. • Assignees: NIST and NCBI • Timeline: First set filtered by end of Sept. • Iterative process to follow matrix priority • Identify data hosting sites: Google, AWS, NCBI, EBI? • Hosted data would provide centralized and synchronized stores for filtered reads to use in reanalysis. • Separate areas clearly labeled as “working” and “released”. • All areas publicly accessible. • Results from pipelines would be posted for group analysis.
  • 7. Task 4 run pipelines • Volunteers to run existing pipelines: • Real time Genomics (Illumina & CG) • NCBI (Illumina, SOLiD, and CG) • Edge Genomics (Illumina and SOLiD) • Timeline: As data are staged and callers installed • References: GRCh37 with baits (G1K standard) no chr Y. • GRCh38 to assess effects of alt haplotypes / fixes • Run modules for all mutation types including mobile elements: • Freebayes. GATK2, CG. • Lobster for LINES, Alu’s CA repeats • Hydra, pindel, proprietary for structural variation.
  • 8. Task 5: consensus call integration. • Recalibration: All or subset of dbSNP, e.g. G1K plus GoESP. • Analysis to produce – BAM, and archive compressed versions in cSRA & CRAM – VCF – or – – Novel compressed genome format: variants and probability that a region matches the reference, i.e. confident nothing but reference in region. [gVCF?] – Quantile the genome by difficulty to align align / call variants – Should analysis include WES on/off target specificity? • Single file per individual. • Future references may include tissue-specific DNA/RNA samples • Think about epigenomic markup to future-proof resource
  • 9. Working as a consortium • Analysis group will need a listserv and periodic discussions via standing conference calls. • Google Doc area • Data staging areas • Further group tasks to be discussed via blogs, telcon and maillists
  • 10. Archival of Data & Pipelines • Ongoing discussions – Cloud data availability – Data formats • Pipeline archival – startup commercial services in this space • robustness? – Amazon – Google – Federal Resources?
  • 11. Data Representation/Data Standards • CDC taking lead in • Alternate approaches convening standards – representation as proposal assemblies – focusing on VCF • not variant calls • gVCF • – assembling working team from stakeholder communities – workshops? – telecons?