1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing

1000G/UK10K: Bioinformatics, storage, and
compute challenges of large scale resequencing

Thomas Keane,
Vertebrate Resequencing Informatics,
Wellcome Trust Sanger Institute,
Cambridge,
UK
E: tk2@sanger.ac.uk

Vertebrate Resequencing Informatics 8th December, 2010

1000G Update

Total Number of Base 23,416GB
Pairs
Aligned Base Pairs 13,527GB
Number of Samples 1103
Samples with > 10GB raw 1078
sequence
Samples with > 10GB 718
aligned sequence

Laura Clarke


1000G update – Raw Sequence Growth

25000 

20000  CEU 
YRI 
JPT 
TSI 
15000  CHB 
ASW 
LWK 
MXL 
10000  GBR 
CHS 
FIN 
PUR 
5000 
CLM 
IBS 

0 
12/17/13  1/17/14  2/17/14  3/17/14  4/17/14  5/17/14  6/17/14  7/17/14  8/17/14  9/17/14 
Laura Clarke
10/17/14 


UK10K

Large scale population/medical based sequencing project
UK10K project recently funded by WT
  4,000 cohort samples genome wide @ 6x
 Deeply phenotyped TwinsUK and ALSPAC cohorts
  6,000 exomes from extreme samples
 Protein coding exons from GenCode
 Extreme end of traits of medical interest, and from collections of familial
cases
 Accumulation of rare variants within genes or pathways
  Utilise computational methods, data formats and workflows developed
during 1000 genomes project
  Data release via EGA under access control
  Estimating 100Tbp of raw sequence data
  http://www.uk10k.org


1000G BAM File Evolutions

BAM
  Until now BAMs included all raw data
  Recently tag removal
 OQ: original qualities
 Non-standard tags: XM, XG, XO
  Also added BAQ differences to indicate non-confidently aligned bases
  Space saving of 30%
 E.g NA19625: 1.45 vs 0.98 bytes per bp
 Primary gain is from removal of original qualities
Further proposals
  Replace base calls with ‘=‘ sign to indicate agreement with reference
  Rejected due to lack of tool support


Population/Transposed BAM

Traditionally BAM files have been produced per sample with all of the
lanes/libraries merged
  Lanes -> Library -> Platform -> Sample (1 per individual)
Problem: population based SNP calling needs to be aware of the
reads across multiple samples at same loci
  Problems with opening hundreds/thousands of file handles
simultaneously
  Distributed/parallel file systems like reading a few large striped files
Solution: Transposed BAMs
  Genome slices with multiple samples within single BAM
 E.g. entire CEU population
  Header information to separate read groups into samples
 Samtools mpileup, GATK etc support this functionality


Horizontal/Transposed BAM
Transposed BAMs

NA19294 Chr1 Chr2 ……..

NA18943 Chr1 Chr2 ……..

……..
NA19305 Chr1 Chr2
.
……..
.
.
.
.

Key questions
  Slice size – chromosome? 1Mbp, 10Mbp or 100Mbp?
  Size of individual groupings – 10, 50, 100, 500 individuals?


VCF Format

Fully adopted by 1000G group as interchange format for variant calls
  SNPs, indels, and recently SVs
  Genotyping calls for all samples
  Annotation of variants via user-defined tags
  VCF APIs and tools via http://vcftools.sourceforge.net
  Scaling issues with VCF – BCF format in development

Petr Danecek

VCF (useful) Bloat

Every release of 1000G adds more tags to VCF files
  ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
  ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
  ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
  ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
  ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
  ##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
  ##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
  ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
  ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
  ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
  ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
  ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">

UK10K propose rich annotation of VCF files
  Known SNPs/indels
 RS IDs, G1K unancessioned SNPs
  Geographical information
 Ensembl annotation (coding, exonic, intronic, UTR, splice..)
 microRNA, eQTL, known disease loci
  Coding consequences
 Synonymous/non-synonymous, splice, stop, GERP score
  Functional interpretation
 Polyphen, Sift, PANTHER


Storage Challenges

Storage
  Try to reduce the proportion of raw data we keep (e.g. images, OQ in
BAM, remove base calls in BAM etc.)
  However there’s still a LOT of data to store and analyse!
  Estimation for our group based on ~200Tbp of sequencing data over next
2-3 years
 1.5 Pbytes
  Permanent: Lane alignments, transposed BAMs, horizontal BAMs, bi-monthly
releases, backup of lane BAMs, Variant calls
  Transient: Library BAMs, Local assemblies
  Storage type optimality criteria
 Cost per Tbyte
 Proximity to compute resources
 Scalability – room for expansion/future proofing
 I/O throughput
 Disaster recovery


A Tiered Solution

3 tiered storage model
Trade off cost, quantity, i/o throughput
Similar to caching strategies in computer design
  Level 0: Local disk, closest proximity to CPU, intermediate temp files e.g.
local assemblies, reference files
  Level 1: High-performance, highly parallel, close proximity to compute,
expensive, suitable for high i/o tasks
  Level 2: Mid-tier storage, some type of nfs technology, discrete units with
some local compute, suitable for low i/o tasks that are compute intensive,
scalable by adding more discrete units
  Level 3: High latency storage, warehouse storage, not suitable to
compute against, occasional access e.g. old data releases
 (Level 3a: Off-site replication of data in level 3)


A Tiered Solution

Cost Size

2 1 Level 1:
3Gb/sec
High performance

CPU Farm
1 2 Level 2: Middle tier/nfs 800Mb/sec

Level 3: Backup/warehouse
1 2
Level 3a: Off-site replication
Level 1
  Data: Current release horizontal + transposed BAMs
  Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)
Level 2
  Data: Lane level BAMs
  Processes: Alignment, recalibration, local realignment
Level 3
  Data: Old release BAMs + variant calls backup

Compute Challenges

Compute
  New algorithms continually developed for more accurate variant calling
  2010 several new processes added into production pipeline
 BAM Improvement
  Local realignment around indels to correct mapping biases (e.g. GATK)
  Adding BAQ differences up front
 Indel calling by local assembly/alternative haplotype analysis (e.g. dindel)
 Local reassembly of SV breakpoints
  Easy to estimate runtime for known processes (e.g. mapping,
recalibration, duplicate removal)
 Challenge to estimate runtime for next 2-3 years for new algorithms
 E.g. more use of assembly methods – more complex references?
I/O has become a significant bottleneck and is most difficult thing to
measure
  All computations need to minimise I/O
 E.g. transforming BAM files to different sort orders


Project Data Release

Do we need to release BAMs?
Large scale human phenotype driven sequencing projects going
forward
  Participants are more interested in the variants than the raw data
BAM files may contain too much data and too large to ship around
amongst project members
UK10K proposals
  Lane BAM files submitted to the archives
  Not release BAM files via project ftp
  Project data release comprise solely of annotated VCF files
  Raw data can be obtained from the archives


1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to 1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing

Similar to 1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing (13)

More from Thomas Keane

More from Thomas Keane (11)

1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing