3. Why change CRAM?
●
CRAM is mature. A GA4GH standard
– 1.0 (2012), 2.0 (2013), 2.1 (2014), 3.0 (2014)
– Java (htsjdk), C (Scramble, Htslib), JavaScript (JBrowse)
– Good speed vs size vs random-access tradeoff.
●
But data has changed.
– e.g. Illumina 4 and 8 quality binning (down from 40+).
●
Broader goals
– Long term archival (higher CPU for smaller files).
– More willing to consider lossy compression.
●
Keep same fundamental format.
– New codecs only (CRAM 3.1); ease of adoption.
7. CRAM file structure
Container: Header ...Slice: Slice Header Block Block Block Block
.
.
.
.
.
.
.
.
.
.
.
.
Container: Header ...Slice: Slice Header Block Block Block Block
Container: Header ...Slice: Slice Header Block Block Block Block
RN
Read
name
QS
Quality
scores
BF
BAM
flags
BC:Z
Aux.
field
Note: Sequence is optionally delta vs an
external or embedded reference sequence.
Via GA4GH refget
http://samtools.github.io/hts-specs/refget.html
Data
Series
Data
Series
Data
Series
Data
Series
8. Selection by region
Container: Header ...Slice: Slice Header Block Block Block Block
.
.
.
.
.
.
.
.
.
.
.
.
Container: Header ...Slice: Slice Header Block Block Block Block
Container: Header ...Slice: Slice Header Block Block Block Block
Data
Series
Data
Series
Data
Series
Data
Series
RN QS BF BC:Z
E.g. samtools view chr1:10,000,000-11,000,000E.g. samtools view chr1:10,000,000-11,000,000
9. Selection by data series
Container: Header ...Slice: Slice Header Block Block Block Block
.
.
.
.
.
.
.
.
.
.
.
.
Container: Header ...Slice: Slice Header Block Block Block Block
Container: Header ...Slice: Slice Header Block Block Block Block
Data
Series
Data
Series
Data
Series
Data
Series
RN QS BF BC:Z
E.g. samtools flagstat
or cram_filter
10. Transport Format: GA4GH Htsget
●
Defined API for querying subsets.
– By region
●
Fast block stitching, no need to decode data.
– By type.
●
Can transparently “drop” unneeded data-series.
●
Json response refers to multiple https streams.
– Permits distributed data.
– Retry of individual streams if one fails.
●
Union of streams is valid BAM or CRAM
– (Add new header & footer to containers)
Kelleher, Jerome, et al. "htsget: a protocol for securely streaming
genomic data." Bioinformatics 35.1 (2018): 119-121.
11. Crypt4GH
●
Format agnostic encryption (BAM, CRAM, VCF...)
– Random access
– Multiple keys in same file (users).
– Can limit to keys to regions
●
Fast re-encryption
– Rewrite new header only (with user’s public key).
– (Link between encrypted archive, Htsget, and users)
●
Review spec:
– https://www.ga4gh.org/wp-content/uploads/crypt4gh.pdf
12. CRAM file structure
Container: Header ...Slice: Slice Header Block Block Block Block
.
.
.
.
.
.
.
.
.
.
.
.
Container: Header ...Slice: Slice Header Block Block Block Block
Container: Header ...Slice: Slice Header Block Block Block Block
Data
Series
Data
Series
Data
Series
Data
Series
RN QS BF BC:Z
Bzip2 Fqzcomp rANS1 Gzip
Bonfield JK, Mahoney MV (2013) Compression of
FASTQ and SAM Format Sequencing Data.
PLoS ONE 8(3): e59190 (Fqzcomp)
Duda, J. Asymmetric numeral systems
arXiv:0902.0271 [cs.IT] (ANS)
13. Compression Basics
a b c d e f g h i j k l m n o p q r s t u v w x y z
0
20
40
60
80
100
120
140
160
180
200 Last letter = a (Order-1 entropy)
a b c d e f g h i j k l m n o p q r s t u v w x y z
0
20
40
60
80
100
120
140 All Symbols (English Text)
Model
Context
14. Compression Basics
a b c d e f g h i j k l m n o p q r s t u v w x y z
0
100
200
300
400
500
600
700
800
900
Last letter = q (Order-1 entropy)
a b c d e f g h i j k l m n o p q r s t u v w x y z
0
20
40
60
80
100
120
140 All Symbols (English Text)
Model
Context (index to array of models)
More predictable.
⇒ Better compression.
15. Quality Values
●
Illumina quality values are cycle and read1 /
read2 specific. (Heatmap from recent MiSeq)
QualityQuality
Cycle No. Read 1
Cycle No. Read 2
16. FQZComp
●
Parameterised version of fqzcomp (2011).
●
Context reset per read; adaptive:
– Previous quality values
– Approximate position within read
– “Smoothness”
●
+ external selector (stored into data-stream):
– Read1 / Read2 (from BAM flags)
– Tile, X, Y coord (from read name)
– Read-group
●
Decoder reads selector value; no understanding
– 15-35% saving over rANS1
17. ID (name) compression
●
Pick a previous ID to compare against.
DIFF -2
●
Tokenise and compare to previous tokens.
MATCH (STRING “VP2-06:112:H7LNDMCVY:1:”)VP2-06:112:H7LNDMCVY:1:”)
DELTA +93 (1251-1158)
MATCH (CHAR “:”)
DIGITS 6253
●
Compress each token series individually.
●
Also adopted by MPEG-G.
VP2-06:112:H7LNDMCVY:1:1124:21694:10473
VP2-06:112:H7LNDMCVY:1:1158:23665:6370
HS25_09827:2:2208:9732:56894#49
VP2-06:112:H7LNDMCVY:1:1251:6253:36119VP2-06:112:H7LNDMCVY:1:1251:6253:36119
20. CRAM breakdown
by data series size
(BQSR modified
Illumina qualities)
Data: SynDip Quality
Names
Aux.
tags
Sequence
Lossy Compression
●
Thought experiment:
– Call VCF with all qualities as-is.
– Set all quality to fixed value (e.g. 30) and call again.
– Matching calls: discard qualities.
Mismatching calls: keep qualities
●
Crumble:
– Use (hom/het) consensus quality.
– ASSUMPTION: single diploid genome
21. Poly-A poorly aligned.
Should have deletion.
Most likely call is A/* het,
but not confident.
High qual C/T het.
Deletion
22. High quality,
but wrong column.
=> looks like rare allele
Low quality
discrepant bases
High quality,
in expected ratios
23. Not confident.
Keep for entire poly-A,
plus couple either side
Confidently erroneous
Confident
CT het
30. ●
Read-names
– If all alignments for template in same CRAM slice.
– Only appropriate after optical deduplication steps.
●
Auxiliary tags
– Whitelist / blacklist options
– Some tags are HUGE and largely pointless.
e.g. GATK “OQ” (original quality).
What else to discard?
34. ●
EBI: Vadim Zalunin (Java Cram)
– Markus Hsi-Yang Fritz et al. (2011); Efficient storage of high throughput DNA
sequencing data using reference-based compression, Genome Research, Vol 21,
Issue 5
●
Sanger: Rob Davies, David Jackson (Htslib / Samtools)
●
Sanger: Richard Durbin, Shane McCarthy
– James K Bonfield et al. (2019); Crumble: reference free lossy compression of
sequence quality values, Bioinformatics, Vol 35, Issue 2
●
Github
– https://github.com/jkbonfield/crumble
– https://github.com/jkbonfield/io_lib
– https://github.com/jkbonfield/htscodecs
– https://github.com/samtools/hts-specs
– https://www.ga4gh.org/wp-content/uploads/crypt4gh.pdf
Acknowledgements