SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
CRAM3.1 and
Crumble
James Bonfeld
Wellcome Sanger Institute
Overview
●
Why?
●
CRAM structure.
●
New lossless codecs
– Compression 101
– CRAM 3.1
●
Lossy compression
– Thought experiment
– Crumble
Why change CRAM?
●
CRAM is mature. A GA4GH standard
– 1.0 (2012), 2.0 (2013), 2.1 (2014), 3.0 (2014)
– Java (htsjdk), C (Scramble, Htslib), JavaScript (JBrowse)
– Good speed vs size vs random-access tradeoff.
●
But data has changed.
– e.g. Illumina 4 and 8 quality binning (down from 40+).
●
Broader goals
– Long term archival (higher CPU for smaller files).
– More willing to consider lossy compression.
●
Keep same fundamental format.
– New codecs only (CRAM 3.1); ease of adoption.
CRAM file structure
Container
EOF marker
Container
Container
.
.
.
CRAM Header
CRAM file structure
Container: SliceCompression Header ...Slice Slice
.
.
.
CRAM Header
EOF marker
Container: SliceCompression Header ...Slice Slice
Container: SliceCompression Header ...Slice Slice
.
.
.
.
.
.
.
.
.
CRAM file structure
Container: Header ...Slice: Slice Header Block Block Block Block
CRAM Header
EOF marker
Container: Header ...Slice: Slice Header Block Block Block Block
Container: Header ...Slice: Slice Header Block Block Block Block
.
.
.
.
.
.
.
.
.
.
.
.
Alignment position,
record number, ...
CRAM file structure
Container: Header ...Slice: Slice Header Block Block Block Block
.
.
.
.
.
.
.
.
.
.
.
.
Container: Header ...Slice: Slice Header Block Block Block Block
Container: Header ...Slice: Slice Header Block Block Block Block
RN
Read
name
QS
Quality
scores
BF
BAM
flags
BC:Z
Aux.
field
Note: Sequence is optionally delta vs an
external or embedded reference sequence.
Via GA4GH refget
http://samtools.github.io/hts-specs/refget.html
Data
Series
Data
Series
Data
Series
Data
Series
Selection by region
Container: Header ...Slice: Slice Header Block Block Block Block
.
.
.
.
.
.
.
.
.
.
.
.
Container: Header ...Slice: Slice Header Block Block Block Block
Container: Header ...Slice: Slice Header Block Block Block Block
Data
Series
Data
Series
Data
Series
Data
Series
RN QS BF BC:Z
E.g. samtools view chr1:10,000,000-11,000,000E.g. samtools view chr1:10,000,000-11,000,000
Selection by data series
Container: Header ...Slice: Slice Header Block Block Block Block
.
.
.
.
.
.
.
.
.
.
.
.
Container: Header ...Slice: Slice Header Block Block Block Block
Container: Header ...Slice: Slice Header Block Block Block Block
Data
Series
Data
Series
Data
Series
Data
Series
RN QS BF BC:Z
E.g. samtools flagstat
or cram_filter
Transport Format: GA4GH Htsget
●
Defined API for querying subsets.
– By region
●
Fast block stitching, no need to decode data.
– By type.
●
Can transparently “drop” unneeded data-series.
●
Json response refers to multiple https streams.
– Permits distributed data.
– Retry of individual streams if one fails.
●
Union of streams is valid BAM or CRAM
– (Add new header & footer to containers)
Kelleher, Jerome, et al. "htsget: a protocol for securely streaming
genomic data." Bioinformatics 35.1 (2018): 119-121.
Crypt4GH
●
Format agnostic encryption (BAM, CRAM, VCF...)
– Random access
– Multiple keys in same file (users).
– Can limit to keys to regions
●
Fast re-encryption
– Rewrite new header only (with user’s public key).
– (Link between encrypted archive, Htsget, and users)
●
Review spec:
– https://www.ga4gh.org/wp-content/uploads/crypt4gh.pdf
CRAM file structure
Container: Header ...Slice: Slice Header Block Block Block Block
.
.
.
.
.
.
.
.
.
.
.
.
Container: Header ...Slice: Slice Header Block Block Block Block
Container: Header ...Slice: Slice Header Block Block Block Block
Data
Series
Data
Series
Data
Series
Data
Series
RN QS BF BC:Z
Bzip2 Fqzcomp rANS1 Gzip
Bonfield JK, Mahoney MV (2013) Compression of
FASTQ and SAM Format Sequencing Data.
PLoS ONE 8(3): e59190 (Fqzcomp)
Duda, J. Asymmetric numeral systems
arXiv:0902.0271 [cs.IT] (ANS)
Compression Basics
a b c d e f g h i j k l m n o p q r s t u v w x y z
0
20
40
60
80
100
120
140
160
180
200 Last letter = a (Order-1 entropy)
a b c d e f g h i j k l m n o p q r s t u v w x y z
0
20
40
60
80
100
120
140 All Symbols (English Text)
Model
Context
Compression Basics
a b c d e f g h i j k l m n o p q r s t u v w x y z
0
100
200
300
400
500
600
700
800
900
Last letter = q (Order-1 entropy)
a b c d e f g h i j k l m n o p q r s t u v w x y z
0
20
40
60
80
100
120
140 All Symbols (English Text)
Model
Context (index to array of models)
More predictable.
⇒ Better compression.
Quality Values
●
Illumina quality values are cycle and read1 /
read2 specific. (Heatmap from recent MiSeq)
QualityQuality
Cycle No. Read 1
Cycle No. Read 2
FQZComp
●
Parameterised version of fqzcomp (2011).
●
Context reset per read; adaptive:
– Previous quality values
– Approximate position within read
– “Smoothness”
●
+ external selector (stored into data-stream):
– Read1 / Read2 (from BAM flags)
– Tile, X, Y coord (from read name)
– Read-group
●
Decoder reads selector value; no understanding
– 15-35% saving over rANS1
ID (name) compression
●
Pick a previous ID to compare against.
DIFF -2
●
Tokenise and compare to previous tokens.
MATCH (STRING “VP2-06:112:H7LNDMCVY:1:”)VP2-06:112:H7LNDMCVY:1:”)
DELTA +93 (1251-1158)
MATCH (CHAR “:”)
DIGITS 6253
●
Compress each token series individually.
●
Also adopted by MPEG-G.
VP2-06:112:H7LNDMCVY:1:1124:21694:10473
VP2-06:112:H7LNDMCVY:1:1158:23665:6370
HS25_09827:2:2208:9732:56894#49
VP2-06:112:H7LNDMCVY:1:1251:6253:36119VP2-06:112:H7LNDMCVY:1:1251:6253:36119
2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000
8
16
32
64
128
256
9827_2#49.bam
Size (Mb)
Time(s)
2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000
8
16
32
64
128
256
9827_2#49.bam
Size (Mb)
Time(s) HiSeq 2000 (40 quals)
6500Mb
CRAM-3.1
CRAM-3.0
CRAM-2.0
BAM
6500Mb
Encode
Decode
Normal: -4%
Archive: -33% (-34.4%)
NovaSeq (4 binned quals)
22 24 26 28 30 32 34 36 38
200
400
800
1600
3200
NA12878-rep3_S1.bam
Size (Gb)
Time(s)
22 24 26 28 30 32 34 36 38
200
400
800
1600
3200
NA12878-rep3_S1.bam
Size (Gb)
Time(s)
70Gb
70Gb
CRAM-3.1
CRAM-3.0
CRAM-2.0
BAM
Encode
Decode
Normal: -16%
Archive: -17% (-21.9%)
CRAM breakdown
by data series size
(BQSR modified
Illumina qualities)
Data: SynDip Quality
Names
Aux.
tags
Sequence
Lossy Compression
●
Thought experiment:
– Call VCF with all qualities as-is.
– Set all quality to fixed value (e.g. 30) and call again.
– Matching calls: discard qualities.
Mismatching calls: keep qualities
●
Crumble:
– Use (hom/het) consensus quality.
– ASSUMPTION: single diploid genome
Poly-A poorly aligned.
Should have deletion.
Most likely call is A/* het,
but not confident.
High qual C/T het.
Deletion
High quality,
but wrong column.
=> looks like rare allele
Low quality
discrepant bases
High quality,
in expected ratios
Not confident.
Keep for entire poly-A,
plus couple either side
Confidently erroneous
Confident
CT het
Confident
CT het
Error correction:
A step too far?
Not confident.
Keep for entire poly-A,
plus couple either side
BAM CRAM 3.0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
MiSeq_Ecoli_DH10B_110721_PF
Orig
Crumble
BAM CRAM 3.0
0
2
4
6
8
10
12
14
16
K562_cytosol_LID8465_TopHat_v2
Orig
Crumble
BAM CRAM 3.0
0
20
40
60
80
100
120
140
160
180
Syndip (CHM1 + CHM13)
Orig-50x
Crumble-50x
Orig-15x
Crumble-15x
GB
Results – file size
WGS
WES
SynDip, 50x coverage
GATK HC + Crumble (0.8)
A synthetic-diploid benchmark for accurate variant-calling evaluation.
Li. et.al. Nature Methods. 15, pp 595–597 (2018)
SynDip, 50x coverage
GATK HC + Crumble (0.8)
A synthetic-diploid benchmark for accurate variant-calling evaluation.
Li. et.al. Nature Methods. 15, pp 595–597 (2018)
●
https://blog.dnanexus.com/2018-07-23-breaking-down-crumble/
DNAnexus results
●
https://blog.dnanexus.com/2018-07-23-breaking-down-crumble/
Plus less pronounced effect on Exome results.
DNAnexus results
●
Read-names
– If all alignments for template in same CRAM slice.
– Only appropriate after optical deduplication steps.
●
Auxiliary tags
– Whitelist / blacklist options
– Some tags are HUGE and largely pointless.
e.g. GATK “OQ” (original quality).
What else to discard?
XN
NM
MD
OP
OC
XT
SA
MQ
AS
UQ
MC
PG
OQ
Names
Qualities
Sequence
Tags
Original file (chr1)
Lossless. 7.42GB.
Aux tags only.
Dominated by OQ:Z
OQ
CRAM size breakdown
XN
NM
MD
OP
OC
XT
SA
MQ
AS
UQ
MC
PG
OQ
Names
Qualities
Sequence
Tags
Names
(names)
Qualities
(qualities)
Sequence
Tags
(tags)
Original file (chr1)
Lossless. 7.42GB.
Aux tags only.
Dominated by OQ:Z
!
Chr1 CRAM size breakdown
Orig BAM: 13.02 GB
Orig CRAM: 7.42 GB
Crumble: 3.10 GB
Crumble--: 0.96 GB
Crumble-- 3.1arc: 0.85 GB
XN
NM
MD
OP
OC
XT
SA
MQ
AS
UQ
MC
PG
OQ
Names
Qualities
Sequence
Tags
Names
(names)
Qualities
(qualities)
(qualities)
Sequence
(Sequence)
Tags
(tags)
Original file (chr1)
Lossless. 7.42GB.
Aux tags only.
Dominated by OQ:Z
!
Chr1 CRAM size breakdown
Orig BAM: 13.02 GB
Orig CRAM: 7.42 GB
Crumble--: 0.72 GB
Crumble-- 3.1arc: 0.63 GB
●
EBI: Vadim Zalunin (Java Cram)
– Markus Hsi-Yang Fritz et al. (2011); Efficient storage of high throughput DNA
sequencing data using reference-based compression, Genome Research, Vol 21,
Issue 5
●
Sanger: Rob Davies, David Jackson (Htslib / Samtools)
●
Sanger: Richard Durbin, Shane McCarthy
– James K Bonfield et al. (2019); Crumble: reference free lossy compression of
sequence quality values, Bioinformatics, Vol 35, Issue 2
●
Github
– https://github.com/jkbonfield/crumble
– https://github.com/jkbonfield/io_lib
– https://github.com/jkbonfield/htscodecs
– https://github.com/samtools/hts-specs
– https://www.ga4gh.org/wp-content/uploads/crypt4gh.pdf
Acknowledgements
Some costings
AWS
● S3: $0.0240/GB/month
● Glacier: $0.0045/GB/month
● G.deep: $0.0018/GB/month
● CPU: $0.0314/hour (spot pricing, m4.large)
$0.1160/hour (normal EC2)
Pay back time
AWS (cents) AWS deep
● BAM -> CRAM 3.0 CPU 1.47c 1.47c
Storage saved (21.9Gb) 1.69c/day 0.127c/day
Pay back time (spot) 0.9 days 11.6 days
cent -GB S3 glacier-deep
● CRAM3.0 → 3.1 normal 1.79 1.96 11 day 5 month
● CRAM3.0 → 3.1 small 3.02 2.55 15 day 7 month
● CRAM3.0 → 3.1 archive 5.49 2.77 25 day 11 month
https://docs.google.com/spreadsheets/d/1DmWhXUM0Os0yfjLyncl8YG2XHA5sL7g2Jd_EocM0gWU/edit#gid=1364826426
FQZComp
●
Table driven. Combine to single 16-bit context.

Contenu connexe

Tendances

Fundamentos da História da Educação
Fundamentos da História da EducaçãoFundamentos da História da Educação
Fundamentos da História da EducaçãoHerbert Santana
 
História da educação
História da educaçãoHistória da educação
História da educaçãoJoemio Freire
 
Projecto de pesquisa
Projecto de pesquisaProjecto de pesquisa
Projecto de pesquisaBruno Gurué
 
O estagio-curricular-e-sua-eficacia-na-educacao-superior
O estagio-curricular-e-sua-eficacia-na-educacao-superiorO estagio-curricular-e-sua-eficacia-na-educacao-superior
O estagio-curricular-e-sua-eficacia-na-educacao-superiordiagoprof
 
Resumo livro terezinha rios - compreender e ensinar
Resumo livro   terezinha rios - compreender e ensinarResumo livro   terezinha rios - compreender e ensinar
Resumo livro terezinha rios - compreender e ensinarSoares Junior
 
Apresentação Pré- projeto tese
Apresentação Pré- projeto tese Apresentação Pré- projeto tese
Apresentação Pré- projeto tese João Piedade
 
Apresentação defesa de mestrado
Apresentação defesa de mestradoApresentação defesa de mestrado
Apresentação defesa de mestradoVanessa Biff
 
Aula de prática de ensino de Ciências/ Biologia [Estágio] Modalidades didáticas
Aula de prática de ensino de Ciências/ Biologia [Estágio] Modalidades didáticasAula de prática de ensino de Ciências/ Biologia [Estágio] Modalidades didáticas
Aula de prática de ensino de Ciências/ Biologia [Estágio] Modalidades didáticasRonaldo Santana
 
Formação Docente Profissional
Formação Docente ProfissionalFormação Docente Profissional
Formação Docente Profissionalprofamiriamnavarro
 
Manifestos dos pioneiros da Educação Nova (1932) e dos educadores (1959)
Manifestos dos pioneiros da Educação Nova (1932) e dos educadores (1959) Manifestos dos pioneiros da Educação Nova (1932) e dos educadores (1959)
Manifestos dos pioneiros da Educação Nova (1932) e dos educadores (1959) richard_romancini
 
Interdisciplinaridade power point
Interdisciplinaridade power pointInterdisciplinaridade power point
Interdisciplinaridade power pointAna Vanessa Paim
 
Apresentação da Defesa de dissertação
Apresentação da Defesa de dissertaçãoApresentação da Defesa de dissertação
Apresentação da Defesa de dissertaçãoAlessandra Galdo
 
2. Freire. P. Pedagogia Autonomia. Paulo Deloroso
2. Freire. P. Pedagogia Autonomia. Paulo Deloroso2. Freire. P. Pedagogia Autonomia. Paulo Deloroso
2. Freire. P. Pedagogia Autonomia. Paulo DelorosoAndrea Cortelazzi
 
Projeto de pesquisa. mestrado. corrigido prof edna
Projeto de pesquisa. mestrado. corrigido prof ednaProjeto de pesquisa. mestrado. corrigido prof edna
Projeto de pesquisa. mestrado. corrigido prof ednaHideane Santana
 
Apresentação da Dissertação de Mestrado
Apresentação da Dissertação de MestradoApresentação da Dissertação de Mestrado
Apresentação da Dissertação de Mestradogiselle_trajano
 
Princípios da educação inclusiva
Princípios da educação inclusivaPrincípios da educação inclusiva
Princípios da educação inclusivamainamgar
 
Pedagogia Hospitalar
Pedagogia HospitalarPedagogia Hospitalar
Pedagogia HospitalarDaniel Rocha
 

Tendances (20)

Fundamentos da História da Educação
Fundamentos da História da EducaçãoFundamentos da História da Educação
Fundamentos da História da Educação
 
Dissertação do Mestrado
Dissertação do MestradoDissertação do Mestrado
Dissertação do Mestrado
 
História da educação
História da educaçãoHistória da educação
História da educação
 
Projecto de pesquisa
Projecto de pesquisaProjecto de pesquisa
Projecto de pesquisa
 
O estagio-curricular-e-sua-eficacia-na-educacao-superior
O estagio-curricular-e-sua-eficacia-na-educacao-superiorO estagio-curricular-e-sua-eficacia-na-educacao-superior
O estagio-curricular-e-sua-eficacia-na-educacao-superior
 
Resumo livro terezinha rios - compreender e ensinar
Resumo livro   terezinha rios - compreender e ensinarResumo livro   terezinha rios - compreender e ensinar
Resumo livro terezinha rios - compreender e ensinar
 
Apresentação Pré- projeto tese
Apresentação Pré- projeto tese Apresentação Pré- projeto tese
Apresentação Pré- projeto tese
 
Apresentação defesa de mestrado
Apresentação defesa de mestradoApresentação defesa de mestrado
Apresentação defesa de mestrado
 
Aula de prática de ensino de Ciências/ Biologia [Estágio] Modalidades didáticas
Aula de prática de ensino de Ciências/ Biologia [Estágio] Modalidades didáticasAula de prática de ensino de Ciências/ Biologia [Estágio] Modalidades didáticas
Aula de prática de ensino de Ciências/ Biologia [Estágio] Modalidades didáticas
 
Formação Docente Profissional
Formação Docente ProfissionalFormação Docente Profissional
Formação Docente Profissional
 
Educação inclusiva
Educação inclusivaEducação inclusiva
Educação inclusiva
 
Ficha de leitura
Ficha de leituraFicha de leitura
Ficha de leitura
 
Manifestos dos pioneiros da Educação Nova (1932) e dos educadores (1959)
Manifestos dos pioneiros da Educação Nova (1932) e dos educadores (1959) Manifestos dos pioneiros da Educação Nova (1932) e dos educadores (1959)
Manifestos dos pioneiros da Educação Nova (1932) e dos educadores (1959)
 
Interdisciplinaridade power point
Interdisciplinaridade power pointInterdisciplinaridade power point
Interdisciplinaridade power point
 
Apresentação da Defesa de dissertação
Apresentação da Defesa de dissertaçãoApresentação da Defesa de dissertação
Apresentação da Defesa de dissertação
 
2. Freire. P. Pedagogia Autonomia. Paulo Deloroso
2. Freire. P. Pedagogia Autonomia. Paulo Deloroso2. Freire. P. Pedagogia Autonomia. Paulo Deloroso
2. Freire. P. Pedagogia Autonomia. Paulo Deloroso
 
Projeto de pesquisa. mestrado. corrigido prof edna
Projeto de pesquisa. mestrado. corrigido prof ednaProjeto de pesquisa. mestrado. corrigido prof edna
Projeto de pesquisa. mestrado. corrigido prof edna
 
Apresentação da Dissertação de Mestrado
Apresentação da Dissertação de MestradoApresentação da Dissertação de Mestrado
Apresentação da Dissertação de Mestrado
 
Princípios da educação inclusiva
Princípios da educação inclusivaPrincípios da educação inclusiva
Princípios da educação inclusiva
 
Pedagogia Hospitalar
Pedagogia HospitalarPedagogia Hospitalar
Pedagogia Hospitalar
 

Similaire à Cram 3.1 / Crumble

Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
RNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptxRNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptxBiancaMoreira45
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...Masahito Ohue
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolHong ChangBum
 
Artificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance TuningArtificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance TuningRoel Van de Paar
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Cheng-Hsuan Li
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationElijah Willie
 
Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Jisc
 
Event Processing and Integration with IAS Data Processors
Event Processing and Integration with IAS Data ProcessorsEvent Processing and Integration with IAS Data Processors
Event Processing and Integration with IAS Data ProcessorsInvenire Aude
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 

Similaire à Cram 3.1 / Crumble (20)

Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Cram4
Cram4Cram4
Cram4
 
RNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptxRNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptx
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
jvm goes to big data
jvm goes to big datajvm goes to big data
jvm goes to big data
 
Cram
CramCram
Cram
 
Artificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance TuningArtificial Intelligence Database Performance Tuning
Artificial Intelligence Database Performance Tuning
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan Presentation
 
Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...Lessons learned from shifting real data around: An ad hoc data challenge from...
Lessons learned from shifting real data around: An ad hoc data challenge from...
 
Event Processing and Integration with IAS Data Processors
Event Processing and Integration with IAS Data ProcessorsEvent Processing and Integration with IAS Data Processors
Event Processing and Integration with IAS Data Processors
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 

Dernier

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to VirusesAreesha Ahmad
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 

Dernier (20)

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 

Cram 3.1 / Crumble

  • 2. Overview ● Why? ● CRAM structure. ● New lossless codecs – Compression 101 – CRAM 3.1 ● Lossy compression – Thought experiment – Crumble
  • 3. Why change CRAM? ● CRAM is mature. A GA4GH standard – 1.0 (2012), 2.0 (2013), 2.1 (2014), 3.0 (2014) – Java (htsjdk), C (Scramble, Htslib), JavaScript (JBrowse) – Good speed vs size vs random-access tradeoff. ● But data has changed. – e.g. Illumina 4 and 8 quality binning (down from 40+). ● Broader goals – Long term archival (higher CPU for smaller files). – More willing to consider lossy compression. ● Keep same fundamental format. – New codecs only (CRAM 3.1); ease of adoption.
  • 4. CRAM file structure Container EOF marker Container Container . . . CRAM Header
  • 5. CRAM file structure Container: SliceCompression Header ...Slice Slice . . . CRAM Header EOF marker Container: SliceCompression Header ...Slice Slice Container: SliceCompression Header ...Slice Slice . . . . . . . . .
  • 6. CRAM file structure Container: Header ...Slice: Slice Header Block Block Block Block CRAM Header EOF marker Container: Header ...Slice: Slice Header Block Block Block Block Container: Header ...Slice: Slice Header Block Block Block Block . . . . . . . . . . . . Alignment position, record number, ...
  • 7. CRAM file structure Container: Header ...Slice: Slice Header Block Block Block Block . . . . . . . . . . . . Container: Header ...Slice: Slice Header Block Block Block Block Container: Header ...Slice: Slice Header Block Block Block Block RN Read name QS Quality scores BF BAM flags BC:Z Aux. field Note: Sequence is optionally delta vs an external or embedded reference sequence. Via GA4GH refget http://samtools.github.io/hts-specs/refget.html Data Series Data Series Data Series Data Series
  • 8. Selection by region Container: Header ...Slice: Slice Header Block Block Block Block . . . . . . . . . . . . Container: Header ...Slice: Slice Header Block Block Block Block Container: Header ...Slice: Slice Header Block Block Block Block Data Series Data Series Data Series Data Series RN QS BF BC:Z E.g. samtools view chr1:10,000,000-11,000,000E.g. samtools view chr1:10,000,000-11,000,000
  • 9. Selection by data series Container: Header ...Slice: Slice Header Block Block Block Block . . . . . . . . . . . . Container: Header ...Slice: Slice Header Block Block Block Block Container: Header ...Slice: Slice Header Block Block Block Block Data Series Data Series Data Series Data Series RN QS BF BC:Z E.g. samtools flagstat or cram_filter
  • 10. Transport Format: GA4GH Htsget ● Defined API for querying subsets. – By region ● Fast block stitching, no need to decode data. – By type. ● Can transparently “drop” unneeded data-series. ● Json response refers to multiple https streams. – Permits distributed data. – Retry of individual streams if one fails. ● Union of streams is valid BAM or CRAM – (Add new header & footer to containers) Kelleher, Jerome, et al. "htsget: a protocol for securely streaming genomic data." Bioinformatics 35.1 (2018): 119-121.
  • 11. Crypt4GH ● Format agnostic encryption (BAM, CRAM, VCF...) – Random access – Multiple keys in same file (users). – Can limit to keys to regions ● Fast re-encryption – Rewrite new header only (with user’s public key). – (Link between encrypted archive, Htsget, and users) ● Review spec: – https://www.ga4gh.org/wp-content/uploads/crypt4gh.pdf
  • 12. CRAM file structure Container: Header ...Slice: Slice Header Block Block Block Block . . . . . . . . . . . . Container: Header ...Slice: Slice Header Block Block Block Block Container: Header ...Slice: Slice Header Block Block Block Block Data Series Data Series Data Series Data Series RN QS BF BC:Z Bzip2 Fqzcomp rANS1 Gzip Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM Format Sequencing Data. PLoS ONE 8(3): e59190 (Fqzcomp) Duda, J. Asymmetric numeral systems arXiv:0902.0271 [cs.IT] (ANS)
  • 13. Compression Basics a b c d e f g h i j k l m n o p q r s t u v w x y z 0 20 40 60 80 100 120 140 160 180 200 Last letter = a (Order-1 entropy) a b c d e f g h i j k l m n o p q r s t u v w x y z 0 20 40 60 80 100 120 140 All Symbols (English Text) Model Context
  • 14. Compression Basics a b c d e f g h i j k l m n o p q r s t u v w x y z 0 100 200 300 400 500 600 700 800 900 Last letter = q (Order-1 entropy) a b c d e f g h i j k l m n o p q r s t u v w x y z 0 20 40 60 80 100 120 140 All Symbols (English Text) Model Context (index to array of models) More predictable. ⇒ Better compression.
  • 15. Quality Values ● Illumina quality values are cycle and read1 / read2 specific. (Heatmap from recent MiSeq) QualityQuality Cycle No. Read 1 Cycle No. Read 2
  • 16. FQZComp ● Parameterised version of fqzcomp (2011). ● Context reset per read; adaptive: – Previous quality values – Approximate position within read – “Smoothness” ● + external selector (stored into data-stream): – Read1 / Read2 (from BAM flags) – Tile, X, Y coord (from read name) – Read-group ● Decoder reads selector value; no understanding – 15-35% saving over rANS1
  • 17. ID (name) compression ● Pick a previous ID to compare against. DIFF -2 ● Tokenise and compare to previous tokens. MATCH (STRING “VP2-06:112:H7LNDMCVY:1:”)VP2-06:112:H7LNDMCVY:1:”) DELTA +93 (1251-1158) MATCH (CHAR “:”) DIGITS 6253 ● Compress each token series individually. ● Also adopted by MPEG-G. VP2-06:112:H7LNDMCVY:1:1124:21694:10473 VP2-06:112:H7LNDMCVY:1:1158:23665:6370 HS25_09827:2:2208:9732:56894#49 VP2-06:112:H7LNDMCVY:1:1251:6253:36119VP2-06:112:H7LNDMCVY:1:1251:6253:36119
  • 18. 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 8 16 32 64 128 256 9827_2#49.bam Size (Mb) Time(s) 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 8 16 32 64 128 256 9827_2#49.bam Size (Mb) Time(s) HiSeq 2000 (40 quals) 6500Mb CRAM-3.1 CRAM-3.0 CRAM-2.0 BAM 6500Mb Encode Decode Normal: -4% Archive: -33% (-34.4%)
  • 19. NovaSeq (4 binned quals) 22 24 26 28 30 32 34 36 38 200 400 800 1600 3200 NA12878-rep3_S1.bam Size (Gb) Time(s) 22 24 26 28 30 32 34 36 38 200 400 800 1600 3200 NA12878-rep3_S1.bam Size (Gb) Time(s) 70Gb 70Gb CRAM-3.1 CRAM-3.0 CRAM-2.0 BAM Encode Decode Normal: -16% Archive: -17% (-21.9%)
  • 20. CRAM breakdown by data series size (BQSR modified Illumina qualities) Data: SynDip Quality Names Aux. tags Sequence Lossy Compression ● Thought experiment: – Call VCF with all qualities as-is. – Set all quality to fixed value (e.g. 30) and call again. – Matching calls: discard qualities. Mismatching calls: keep qualities ● Crumble: – Use (hom/het) consensus quality. – ASSUMPTION: single diploid genome
  • 21. Poly-A poorly aligned. Should have deletion. Most likely call is A/* het, but not confident. High qual C/T het. Deletion
  • 22. High quality, but wrong column. => looks like rare allele Low quality discrepant bases High quality, in expected ratios
  • 23. Not confident. Keep for entire poly-A, plus couple either side Confidently erroneous Confident CT het
  • 24. Confident CT het Error correction: A step too far? Not confident. Keep for entire poly-A, plus couple either side
  • 25. BAM CRAM 3.0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 MiSeq_Ecoli_DH10B_110721_PF Orig Crumble BAM CRAM 3.0 0 2 4 6 8 10 12 14 16 K562_cytosol_LID8465_TopHat_v2 Orig Crumble BAM CRAM 3.0 0 20 40 60 80 100 120 140 160 180 Syndip (CHM1 + CHM13) Orig-50x Crumble-50x Orig-15x Crumble-15x GB Results – file size WGS WES
  • 26. SynDip, 50x coverage GATK HC + Crumble (0.8) A synthetic-diploid benchmark for accurate variant-calling evaluation. Li. et.al. Nature Methods. 15, pp 595–597 (2018)
  • 27. SynDip, 50x coverage GATK HC + Crumble (0.8) A synthetic-diploid benchmark for accurate variant-calling evaluation. Li. et.al. Nature Methods. 15, pp 595–597 (2018)
  • 30. ● Read-names – If all alignments for template in same CRAM slice. – Only appropriate after optical deduplication steps. ● Auxiliary tags – Whitelist / blacklist options – Some tags are HUGE and largely pointless. e.g. GATK “OQ” (original quality). What else to discard?
  • 31. XN NM MD OP OC XT SA MQ AS UQ MC PG OQ Names Qualities Sequence Tags Original file (chr1) Lossless. 7.42GB. Aux tags only. Dominated by OQ:Z OQ CRAM size breakdown
  • 32. XN NM MD OP OC XT SA MQ AS UQ MC PG OQ Names Qualities Sequence Tags Names (names) Qualities (qualities) Sequence Tags (tags) Original file (chr1) Lossless. 7.42GB. Aux tags only. Dominated by OQ:Z ! Chr1 CRAM size breakdown Orig BAM: 13.02 GB Orig CRAM: 7.42 GB Crumble: 3.10 GB Crumble--: 0.96 GB Crumble-- 3.1arc: 0.85 GB
  • 33. XN NM MD OP OC XT SA MQ AS UQ MC PG OQ Names Qualities Sequence Tags Names (names) Qualities (qualities) (qualities) Sequence (Sequence) Tags (tags) Original file (chr1) Lossless. 7.42GB. Aux tags only. Dominated by OQ:Z ! Chr1 CRAM size breakdown Orig BAM: 13.02 GB Orig CRAM: 7.42 GB Crumble--: 0.72 GB Crumble-- 3.1arc: 0.63 GB
  • 34. ● EBI: Vadim Zalunin (Java Cram) – Markus Hsi-Yang Fritz et al. (2011); Efficient storage of high throughput DNA sequencing data using reference-based compression, Genome Research, Vol 21, Issue 5 ● Sanger: Rob Davies, David Jackson (Htslib / Samtools) ● Sanger: Richard Durbin, Shane McCarthy – James K Bonfield et al. (2019); Crumble: reference free lossy compression of sequence quality values, Bioinformatics, Vol 35, Issue 2 ● Github – https://github.com/jkbonfield/crumble – https://github.com/jkbonfield/io_lib – https://github.com/jkbonfield/htscodecs – https://github.com/samtools/hts-specs – https://www.ga4gh.org/wp-content/uploads/crypt4gh.pdf Acknowledgements
  • 35. Some costings AWS ● S3: $0.0240/GB/month ● Glacier: $0.0045/GB/month ● G.deep: $0.0018/GB/month ● CPU: $0.0314/hour (spot pricing, m4.large) $0.1160/hour (normal EC2)
  • 36. Pay back time AWS (cents) AWS deep ● BAM -> CRAM 3.0 CPU 1.47c 1.47c Storage saved (21.9Gb) 1.69c/day 0.127c/day Pay back time (spot) 0.9 days 11.6 days cent -GB S3 glacier-deep ● CRAM3.0 → 3.1 normal 1.79 1.96 11 day 5 month ● CRAM3.0 → 3.1 small 3.02 2.55 15 day 7 month ● CRAM3.0 → 3.1 archive 5.49 2.77 25 day 11 month https://docs.google.com/spreadsheets/d/1DmWhXUM0Os0yfjLyncl8YG2XHA5sL7g2Jd_EocM0gWU/edit#gid=1364826426
  • 37. FQZComp ● Table driven. Combine to single 16-bit context.