1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
1. 1000G/UK10K: Bioinformatics, storage, and
compute challenges of large scale resequencing
Thomas Keane,
Vertebrate Resequencing Informatics,
Wellcome Trust Sanger Institute,
Cambridge,
UK
E: tk2@sanger.ac.uk
Vertebrate Resequencing Informatics 8th December, 2010
2. 1000G Update
Total Number of Base 23,416GB
Pairs
Aligned Base Pairs 13,527GB
Number of Samples 1103
Samples with > 10GB raw 1078
sequence
Samples with > 10GB 718
aligned sequence
Laura Clarke
Vertebrate Resequencing Informatics 8th December, 2010
4. UK10K
Large scale population/medical based sequencing project
UK10K project recently funded by WT
4,000 cohort samples genome wide @ 6x
Deeply phenotyped TwinsUK and ALSPAC cohorts
6,000 exomes from extreme samples
Protein coding exons from GenCode
Extreme end of traits of medical interest, and from collections of familial
cases
Accumulation of rare variants within genes or pathways
Utilise computational methods, data formats and workflows developed
during 1000 genomes project
Data release via EGA under access control
Estimating 100Tbp of raw sequence data
http://www.uk10k.org
Vertebrate Resequencing Informatics 8th December, 2010
5. 1000G BAM File Evolutions
BAM
Until now BAMs included all raw data
Recently tag removal
OQ: original qualities
Non-standard tags: XM, XG, XO
Also added BAQ differences to indicate non-confidently aligned bases
Space saving of 30%
E.g NA19625: 1.45 vs 0.98 bytes per bp
Primary gain is from removal of original qualities
Further proposals
Replace base calls with ‘=‘ sign to indicate agreement with reference
Rejected due to lack of tool support
Vertebrate Resequencing Informatics 8th December, 2010
6. Population/Transposed BAM
Traditionally BAM files have been produced per sample with all of the
lanes/libraries merged
Lanes -> Library -> Platform -> Sample (1 per individual)
Problem: population based SNP calling needs to be aware of the
reads across multiple samples at same loci
Problems with opening hundreds/thousands of file handles
simultaneously
Distributed/parallel file systems like reading a few large striped files
Solution: Transposed BAMs
Genome slices with multiple samples within single BAM
E.g. entire CEU population
Header information to separate read groups into samples
Samtools mpileup, GATK etc support this functionality
Vertebrate Resequencing Informatics 8th December, 2010
8. VCF Format
Fully adopted by 1000G group as interchange format for variant calls
SNPs, indels, and recently SVs
Genotyping calls for all samples
Annotation of variants via user-defined tags
VCF APIs and tools via http://vcftools.sourceforge.net
Scaling issues with VCF – BCF format in development
Petr Danecek
Vertebrate Resequencing Informatics 8th December, 2010
9. VCF (useful) Bloat
Every release of 1000G adds more tags to VCF files
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
UK10K propose rich annotation of VCF files
Known SNPs/indels
RS IDs, G1K unancessioned SNPs
Geographical information
Ensembl annotation (coding, exonic, intronic, UTR, splice..)
microRNA, eQTL, known disease loci
Coding consequences
Synonymous/non-synonymous, splice, stop, GERP score
Functional interpretation
Polyphen, Sift, PANTHER
Vertebrate Resequencing Informatics 8th December, 2010
10. Storage Challenges
Storage
Try to reduce the proportion of raw data we keep (e.g. images, OQ in
BAM, remove base calls in BAM etc.)
However there’s still a LOT of data to store and analyse!
Estimation for our group based on ~200Tbp of sequencing data over next
2-3 years
1.5 Pbytes
Permanent: Lane alignments, transposed BAMs, horizontal BAMs, bi-monthly
releases, backup of lane BAMs, Variant calls
Transient: Library BAMs, Local assemblies
Storage type optimality criteria
Cost per Tbyte
Proximity to compute resources
Scalability – room for expansion/future proofing
I/O throughput
Disaster recovery
Vertebrate Resequencing Informatics 8th December, 2010
11. A Tiered Solution
3 tiered storage model
Trade off cost, quantity, i/o throughput
Similar to caching strategies in computer design
Level 0: Local disk, closest proximity to CPU, intermediate temp files e.g.
local assemblies, reference files
Level 1: High-performance, highly parallel, close proximity to compute,
expensive, suitable for high i/o tasks
Level 2: Mid-tier storage, some type of nfs technology, discrete units with
some local compute, suitable for low i/o tasks that are compute intensive,
scalable by adding more discrete units
Level 3: High latency storage, warehouse storage, not suitable to
compute against, occasional access e.g. old data releases
(Level 3a: Off-site replication of data in level 3)
Vertebrate Resequencing Informatics 8th December, 2010
13. Compute Challenges
Compute
New algorithms continually developed for more accurate variant calling
2010 several new processes added into production pipeline
BAM Improvement
Local realignment around indels to correct mapping biases (e.g. GATK)
Adding BAQ differences up front
Indel calling by local assembly/alternative haplotype analysis (e.g. dindel)
Local reassembly of SV breakpoints
Easy to estimate runtime for known processes (e.g. mapping,
recalibration, duplicate removal)
Challenge to estimate runtime for next 2-3 years for new algorithms
E.g. more use of assembly methods – more complex references?
I/O has become a significant bottleneck and is most difficult thing to
measure
All computations need to minimise I/O
E.g. transforming BAM files to different sort orders
Vertebrate Resequencing Informatics 8th December, 2010
14. Project Data Release
Do we need to release BAMs?
Large scale human phenotype driven sequencing projects going
forward
Participants are more interested in the variants than the raw data
BAM files may contain too much data and too large to ship around
amongst project members
UK10K proposals
Lane BAM files submitted to the archives
Not release BAM files via project ftp
Project data release comprise solely of annotated VCF files
Raw data can be obtained from the archives
Vertebrate Resequencing Informatics 8th December, 2010