SlideShare a Scribd company logo
1 of 14
Download to read offline
1000G/UK10K: Bioinformatics, storage, and
   compute challenges of large scale resequencing


   Thomas Keane,
   Vertebrate Resequencing Informatics,
   Wellcome Trust Sanger Institute,
   Cambridge,
   UK
   E: tk2@sanger.ac.uk




Vertebrate Resequencing Informatics   8th December, 2010
1000G Update


  Total Number of Base                                     23,416GB
  Pairs
  Aligned Base Pairs                                       13,527GB
  Number of Samples                                        1103
  Samples with > 10GB raw                                  1078
  sequence
  Samples with > 10GB                                      718
  aligned sequence

                                                                      Laura Clarke

Vertebrate Resequencing Informatics   8th December, 2010
1000G update – Raw Sequence Growth

25000





20000
                                                                                                                            CEU

                                                                                                                                  YRI

                                                                                                                                  JPT

                                                                                                                                  TSI

15000
                                                                                                                            CHB

                                                                                                                                  ASW

                                                                                                                                  LWK

                                                                                                                                  MXL

10000
                                                                                                                            GBR

                                                                                                                                  CHS

                                                                                                                                  FIN

                                                                                                                                  PUR

 5000

                                                                                                                                  CLM

                                                                                                                                  IBS




    0

   12/17/13
   1/17/14
    2/17/14
     3/17/14
   4/17/14
    5/17/14
   6/17/14
   7/17/14
   8/17/14
   9/17/14

                                                                                                                            Laura Clarke
                                                                                                                      10/17/14



  Vertebrate Resequencing Informatics     8th December, 2010
UK10K

 Large scale population/medical based sequencing project
 UK10K project recently funded by WT
      4,000 cohort samples genome wide @ 6x
             Deeply phenotyped TwinsUK and ALSPAC cohorts
      6,000 exomes from extreme samples
             Protein coding exons from GenCode
             Extreme end of traits of medical interest, and from collections of familial
              cases
             Accumulation of rare variants within genes or pathways
      Utilise computational methods, data formats and workflows developed
       during 1000 genomes project
      Data release via EGA under access control
      Estimating 100Tbp of raw sequence data
      http://www.uk10k.org




Vertebrate Resequencing Informatics   8th December, 2010
1000G BAM File Evolutions

 BAM
      Until now BAMs included all raw data
      Recently tag removal
             OQ: original qualities
             Non-standard tags: XM, XG, XO
      Also added BAQ differences to indicate non-confidently aligned bases
      Space saving of 30%
             E.g NA19625: 1.45 vs 0.98 bytes per bp
             Primary gain is from removal of original qualities
 Further proposals
      Replace base calls with ‘=‘ sign to indicate agreement with reference
      Rejected due to lack of tool support




Vertebrate Resequencing Informatics   8th December, 2010
Population/Transposed BAM

 Traditionally BAM files have been produced per sample with all of the
 lanes/libraries merged
      Lanes -> Library -> Platform -> Sample (1 per individual)
 Problem: population based SNP calling needs to be aware of the
 reads across multiple samples at same loci
      Problems with opening hundreds/thousands of file handles
       simultaneously
      Distributed/parallel file systems like reading a few large striped files
 Solution: Transposed BAMs
      Genome slices with multiple samples within single BAM
             E.g. entire CEU population
      Header information to separate read groups into samples
             Samtools mpileup, GATK etc support this functionality




Vertebrate Resequencing Informatics   8th December, 2010
Horizontal/Transposed BAM
                   Transposed BAMs


NA19294                                Chr1                            Chr2   ……..

NA18943                                Chr1                            Chr2   ……..

                                                                              ……..
NA19305                                Chr1                            Chr2
   .
                                                                              ……..
   .
   .
   .
   .



  Key questions
       Slice size – chromosome? 1Mbp, 10Mbp or 100Mbp?
       Size of individual groupings – 10, 50, 100, 500 individuals?


 Vertebrate Resequencing Informatics   8th December, 2010
VCF Format

 Fully adopted by 1000G group as interchange format for variant calls
        SNPs, indels, and recently SVs
        Genotyping calls for all samples
        Annotation of variants via user-defined tags
        VCF APIs and tools via http://vcftools.sourceforge.net
        Scaling issues with VCF – BCF format in development




                                                                  Petr Danecek
Vertebrate Resequencing Informatics   8th December, 2010
VCF (useful) Bloat

 Every release of 1000G adds more tags to VCF files
        ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
        ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
        ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
        ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
        ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
        ##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
        ##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
        ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
        ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
        ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
        ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
        ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">

 UK10K propose rich annotation of VCF files
      Known SNPs/indels
              RS IDs, G1K unancessioned SNPs
      Geographical information
              Ensembl annotation (coding, exonic, intronic, UTR, splice..)
              microRNA, eQTL, known disease loci
      Coding consequences
              Synonymous/non-synonymous, splice, stop, GERP score
      Functional interpretation
              Polyphen, Sift, PANTHER


Vertebrate Resequencing Informatics          8th December, 2010
Storage Challenges

 Storage
      Try to reduce the proportion of raw data we keep (e.g. images, OQ in
       BAM, remove base calls in BAM etc.)
      However there’s still a LOT of data to store and analyse!
      Estimation for our group based on ~200Tbp of sequencing data over next
       2-3 years
             1.5 Pbytes
                      Permanent: Lane alignments, transposed BAMs, horizontal BAMs, bi-monthly
                       releases, backup of lane BAMs, Variant calls
                      Transient: Library BAMs, Local assemblies
      Storage type optimality criteria
             Cost per Tbyte
             Proximity to compute resources
             Scalability – room for expansion/future proofing
             I/O throughput
             Disaster recovery


Vertebrate Resequencing Informatics   8th December, 2010
A Tiered Solution

 3 tiered storage model
 Trade off cost, quantity, i/o throughput
 Similar to caching strategies in computer design
      Level 0: Local disk, closest proximity to CPU, intermediate temp files e.g.
       local assemblies, reference files
      Level 1: High-performance, highly parallel, close proximity to compute,
       expensive, suitable for high i/o tasks
      Level 2: Mid-tier storage, some type of nfs technology, discrete units with
       some local compute, suitable for low i/o tasks that are compute intensive,
       scalable by adding more discrete units
      Level 3: High latency storage, warehouse storage, not suitable to
       compute against, occasional access e.g. old data releases
             (Level 3a: Off-site replication of data in level 3)




Vertebrate Resequencing Informatics   8th December, 2010
A Tiered Solution

Cost            Size

 2                 1                                             Level 1:
                                                                                       3Gb/sec
                                                            High performance




                                                                                                     CPU Farm
 1                 2                             Level 2: Middle tier/nfs                800Mb/sec



                                            Level 3: Backup/warehouse
 1                 2
                                                      Level 3a: Off-site replication
  Level 1
       Data: Current release horizontal + transposed BAMs
       Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)
  Level 2
       Data: Lane level BAMs
       Processes: Alignment, recalibration, local realignment
  Level 3
       Data: Old release BAMs + variant calls backup
 Vertebrate Resequencing Informatics   8th December, 2010
Compute Challenges

 Compute
      New algorithms continually developed for more accurate variant calling
      2010 several new processes added into production pipeline
             BAM Improvement
                      Local realignment around indels to correct mapping biases (e.g. GATK)
                      Adding BAQ differences up front
             Indel calling by local assembly/alternative haplotype analysis (e.g. dindel)
             Local reassembly of SV breakpoints
      Easy to estimate runtime for known processes (e.g. mapping,
       recalibration, duplicate removal)
             Challenge to estimate runtime for next 2-3 years for new algorithms
             E.g. more use of assembly methods – more complex references?
 I/O has become a significant bottleneck and is most difficult thing to
 measure
      All computations need to minimise I/O
             E.g. transforming BAM files to different sort orders

Vertebrate Resequencing Informatics   8th December, 2010
Project Data Release

 Do we need to release BAMs?
 Large scale human phenotype driven sequencing projects going
 forward
      Participants are more interested in the variants than the raw data
 BAM files may contain too much data and too large to ship around
 amongst project members
 UK10K proposals
      Lane BAM files submitted to the archives
      Not release BAM files via project ftp
      Project data release comprise solely of annotated VCF files
      Raw data can be obtained from the archives




Vertebrate Resequencing Informatics   8th December, 2010

More Related Content

Viewers also liked

Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 
Human genome project
Human genome projectHuman genome project
Human genome projectShital Pal
 
Key Issues on the Economics of Precision Medicine
Key Issues on the Economics of Precision MedicineKey Issues on the Economics of Precision Medicine
Key Issues on the Economics of Precision MedicineHEHTAslides
 

Viewers also liked (6)

Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Human genome project 1
Human genome project 1Human genome project 1
Human genome project 1
 
Human Genome Project
Human Genome ProjectHuman Genome Project
Human Genome Project
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
Key Issues on the Economics of Precision Medicine
Key Issues on the Economics of Precision MedicineKey Issues on the Economics of Precision Medicine
Key Issues on the Economics of Precision Medicine
 

Similar to 1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing

ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialThomas Keane
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data PreprocessingcursoNGS
 
Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15
Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15
Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15MLconf
 
Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Thomas Keane
 
31P-MRSI of the brain with B1-shimmed NOE enhancement v6
31P-MRSI of the brain with B1-shimmed NOE enhancement v631P-MRSI of the brain with B1-shimmed NOE enhancement v6
31P-MRSI of the brain with B1-shimmed NOE enhancement v6Bart van de Bank
 
Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01Nicolas Gobet
 
860 dspi cso_ctb_appnote
860 dspi cso_ctb_appnote860 dspi cso_ctb_appnote
860 dspi cso_ctb_appnotetrilithicweb
 
Sequencing 2017
Sequencing 2017Sequencing 2017
Sequencing 2017Surya Saha
 
Ultratrace oxygenate analysis by GC/MS
Ultratrace oxygenate analysis by GC/MSUltratrace oxygenate analysis by GC/MS
Ultratrace oxygenate analysis by GC/MSJoeri Vercammen, PhD
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiomejukais
 
principals of computed tomography / dental implant courses
principals of computed tomography / dental implant coursesprincipals of computed tomography / dental implant courses
principals of computed tomography / dental implant coursesIndian dental academy
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineGabe Rudy
 

Similar to 1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing (13)

ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data Preprocessing
 
Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15
Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15
Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15
 
Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
31P-MRSI of the brain with B1-shimmed NOE enhancement v6
31P-MRSI of the brain with B1-shimmed NOE enhancement v631P-MRSI of the brain with B1-shimmed NOE enhancement v6
31P-MRSI of the brain with B1-shimmed NOE enhancement v6
 
Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01
 
860 dspi cso_ctb_appnote
860 dspi cso_ctb_appnote860 dspi cso_ctb_appnote
860 dspi cso_ctb_appnote
 
Sequencing 2017
Sequencing 2017Sequencing 2017
Sequencing 2017
 
Ultratrace oxygenate analysis by GC/MS
Ultratrace oxygenate analysis by GC/MSUltratrace oxygenate analysis by GC/MS
Ultratrace oxygenate analysis by GC/MS
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
 
principals of computed tomography / dental implant courses
principals of computed tomography / dental implant coursesprincipals of computed tomography / dental implant courses
principals of computed tomography / dental implant courses
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision Medicine
 

More from Thomas Keane

Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsThomas Keane
 
Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)Thomas Keane
 
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture22014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture2Thomas Keane
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Thomas Keane
 
Mouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingMouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingThomas Keane
 
Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Thomas Keane
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataThomas Keane
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Thomas Keane
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraThomas Keane
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataThomas Keane
 
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Thomas Keane
 

More from Thomas Keane (11)

Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
 
Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)
 
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture22014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
 
Mouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingMouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-Editing
 
Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing era
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010
 

1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing

  • 1. 1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing Thomas Keane, Vertebrate Resequencing Informatics, Wellcome Trust Sanger Institute, Cambridge, UK E: tk2@sanger.ac.uk Vertebrate Resequencing Informatics 8th December, 2010
  • 2. 1000G Update Total Number of Base 23,416GB Pairs Aligned Base Pairs 13,527GB Number of Samples 1103 Samples with > 10GB raw 1078 sequence Samples with > 10GB 718 aligned sequence Laura Clarke Vertebrate Resequencing Informatics 8th December, 2010
  • 3. 1000G update – Raw Sequence Growth 25000
 20000
 CEU
 YRI
 JPT
 TSI
 15000
 CHB
 ASW
 LWK
 MXL
 10000
 GBR
 CHS
 FIN
 PUR
 5000
 CLM
 IBS
 0
 12/17/13
 1/17/14
 2/17/14
 3/17/14
 4/17/14
 5/17/14
 6/17/14
 7/17/14
 8/17/14
 9/17/14
 Laura Clarke 10/17/14
 Vertebrate Resequencing Informatics 8th December, 2010
  • 4. UK10K Large scale population/medical based sequencing project UK10K project recently funded by WT   4,000 cohort samples genome wide @ 6x  Deeply phenotyped TwinsUK and ALSPAC cohorts   6,000 exomes from extreme samples  Protein coding exons from GenCode  Extreme end of traits of medical interest, and from collections of familial cases  Accumulation of rare variants within genes or pathways   Utilise computational methods, data formats and workflows developed during 1000 genomes project   Data release via EGA under access control   Estimating 100Tbp of raw sequence data   http://www.uk10k.org Vertebrate Resequencing Informatics 8th December, 2010
  • 5. 1000G BAM File Evolutions BAM   Until now BAMs included all raw data   Recently tag removal  OQ: original qualities  Non-standard tags: XM, XG, XO   Also added BAQ differences to indicate non-confidently aligned bases   Space saving of 30%  E.g NA19625: 1.45 vs 0.98 bytes per bp  Primary gain is from removal of original qualities Further proposals   Replace base calls with ‘=‘ sign to indicate agreement with reference   Rejected due to lack of tool support Vertebrate Resequencing Informatics 8th December, 2010
  • 6. Population/Transposed BAM Traditionally BAM files have been produced per sample with all of the lanes/libraries merged   Lanes -> Library -> Platform -> Sample (1 per individual) Problem: population based SNP calling needs to be aware of the reads across multiple samples at same loci   Problems with opening hundreds/thousands of file handles simultaneously   Distributed/parallel file systems like reading a few large striped files Solution: Transposed BAMs   Genome slices with multiple samples within single BAM  E.g. entire CEU population   Header information to separate read groups into samples  Samtools mpileup, GATK etc support this functionality Vertebrate Resequencing Informatics 8th December, 2010
  • 7. Horizontal/Transposed BAM Transposed BAMs NA19294 Chr1 Chr2 …….. NA18943 Chr1 Chr2 …….. …….. NA19305 Chr1 Chr2 . …….. . . . . Key questions   Slice size – chromosome? 1Mbp, 10Mbp or 100Mbp?   Size of individual groupings – 10, 50, 100, 500 individuals? Vertebrate Resequencing Informatics 8th December, 2010
  • 8. VCF Format Fully adopted by 1000G group as interchange format for variant calls   SNPs, indels, and recently SVs   Genotyping calls for all samples   Annotation of variants via user-defined tags   VCF APIs and tools via http://vcftools.sourceforge.net   Scaling issues with VCF – BCF format in development Petr Danecek Vertebrate Resequencing Informatics 8th December, 2010
  • 9. VCF (useful) Bloat Every release of 1000G adds more tags to VCF files   ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">   ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">   ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">   ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">   ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">   ##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">   ##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">   ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">   ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">   ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">   ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">   ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias"> UK10K propose rich annotation of VCF files   Known SNPs/indels  RS IDs, G1K unancessioned SNPs   Geographical information  Ensembl annotation (coding, exonic, intronic, UTR, splice..)  microRNA, eQTL, known disease loci   Coding consequences  Synonymous/non-synonymous, splice, stop, GERP score   Functional interpretation  Polyphen, Sift, PANTHER Vertebrate Resequencing Informatics 8th December, 2010
  • 10. Storage Challenges Storage   Try to reduce the proportion of raw data we keep (e.g. images, OQ in BAM, remove base calls in BAM etc.)   However there’s still a LOT of data to store and analyse!   Estimation for our group based on ~200Tbp of sequencing data over next 2-3 years  1.5 Pbytes   Permanent: Lane alignments, transposed BAMs, horizontal BAMs, bi-monthly releases, backup of lane BAMs, Variant calls   Transient: Library BAMs, Local assemblies   Storage type optimality criteria  Cost per Tbyte  Proximity to compute resources  Scalability – room for expansion/future proofing  I/O throughput  Disaster recovery Vertebrate Resequencing Informatics 8th December, 2010
  • 11. A Tiered Solution 3 tiered storage model Trade off cost, quantity, i/o throughput Similar to caching strategies in computer design   Level 0: Local disk, closest proximity to CPU, intermediate temp files e.g. local assemblies, reference files   Level 1: High-performance, highly parallel, close proximity to compute, expensive, suitable for high i/o tasks   Level 2: Mid-tier storage, some type of nfs technology, discrete units with some local compute, suitable for low i/o tasks that are compute intensive, scalable by adding more discrete units   Level 3: High latency storage, warehouse storage, not suitable to compute against, occasional access e.g. old data releases  (Level 3a: Off-site replication of data in level 3) Vertebrate Resequencing Informatics 8th December, 2010
  • 12. A Tiered Solution Cost Size 2 1 Level 1: 3Gb/sec High performance CPU Farm 1 2 Level 2: Middle tier/nfs 800Mb/sec Level 3: Backup/warehouse 1 2 Level 3a: Off-site replication Level 1   Data: Current release horizontal + transposed BAMs   Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs) Level 2   Data: Lane level BAMs   Processes: Alignment, recalibration, local realignment Level 3   Data: Old release BAMs + variant calls backup Vertebrate Resequencing Informatics 8th December, 2010
  • 13. Compute Challenges Compute   New algorithms continually developed for more accurate variant calling   2010 several new processes added into production pipeline  BAM Improvement   Local realignment around indels to correct mapping biases (e.g. GATK)   Adding BAQ differences up front  Indel calling by local assembly/alternative haplotype analysis (e.g. dindel)  Local reassembly of SV breakpoints   Easy to estimate runtime for known processes (e.g. mapping, recalibration, duplicate removal)  Challenge to estimate runtime for next 2-3 years for new algorithms  E.g. more use of assembly methods – more complex references? I/O has become a significant bottleneck and is most difficult thing to measure   All computations need to minimise I/O  E.g. transforming BAM files to different sort orders Vertebrate Resequencing Informatics 8th December, 2010
  • 14. Project Data Release Do we need to release BAMs? Large scale human phenotype driven sequencing projects going forward   Participants are more interested in the variants than the raw data BAM files may contain too much data and too large to ship around amongst project members UK10K proposals   Lane BAM files submitted to the archives   Not release BAM files via project ftp   Project data release comprise solely of annotated VCF files   Raw data can be obtained from the archives Vertebrate Resequencing Informatics 8th December, 2010