1. Next Generation Sequencing
File Formats.
Pierre Lindenbaum
@yokofakun
pierre.lindenbaum@univ-nantes.fr
http://plindenbaum.blogspot.com
https://github.com/lindenb/courses
Institut du Thorax. Nantes. France
September 19, 2014
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
2. You don't need to have a deep knowledge of those formats.
(Unless you're doing NGS)
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
3. Understand how people have solved their BIG data problems.
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
4. Why sequencing ?
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
7. Well, that's a little more complicated ...
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
8. FASTQ
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
9. FASTQ
FASTQ: text-based format for storing both a DNA sequence and
its corresponding quality scores
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
10. FASTQ
FASTQ for single end
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
11. FASTQ
FASTQ for paired end
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
13. FASTQ name
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
Col Brief description
EAS139 the unique instrument name
136 the run id
FC706VJ the
owcell id
2
owcell lane
2104 tile number within the
owcell lane
15343 'x'-coordinate of the cluster within the tile
197393 'y'-coordinate of the cluster within the tile
1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y Y if the read fails
14. lter (read is bad), N otherwise
18 0 when none of the control bits are on, otherwise it is an even number
ATCACG index sequence
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
15. FASTQ Quality
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
16. FASTQ Quality
A quality value Q is an integer mapping of p (i.e., the probability
that the corresponding base call is incorrect).
Qsanger = 10 log10 p
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
17. FASTQ Quality
Since a human readable format is desired for SAM, 33 is added to
the calculated quality in order to make it a printable character
ranging from ! - .
Qsanger = 10 log10 p + 33
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
19. SAM
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
20. SAM Format
SAM (Sequence Alignment/Map) format is a generic format for
storing large nucleotide sequence alignments
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
21. SAM Format
Is
exible enough to store all the alignment information
generated by various alignment programs;
Is simple enough to be easily generated by alignment
programs or converted from existing alignment formats;
Is compact in
22. le size;
Allows most of operations on the alignment to work on a
stream without loading the whole alignment into memory;
Allows the
23. le to be indexed by genomic position to eciently
retrieve all reads aligning to a locus.
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
24. SAM Format
Structure
+ HEADER
-version
-program parameters
+GENOME
- chrom1 size
- chrom2 size
- chrom3 size
- (..)
+GROUPS
- group1 : sample1, lane 4
- group2 : sample2, lane 1
+ BODY
- READ1 - group1
- READ2 - group1
- READ3 - group1
- READ4 - group2
Pierre Linde-nbau(m.@y.o.ko)fakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
26. SAM Header Section
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
28. SAM Alignment Section
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
32. cations
Record Column
Col Field Type Brief description
1 QNAME String Query template NAME
2 FLAG Int bitwise FLAG
3 RNAME String Reference sequence NAME
4 POS Int 1-based leftmost mapping POSition
5 MAPQ Int MAPping Quality
6 CIGAR String CIGAR string
7 RNEXT String Ref. name of the mate/next read
8 PNEXT Int Position of the mate/next read
9 TLEN Int observed Template LENgth
10 SEQ String segment SEQuence
11 QUAL String ASCII of Phred-scaled base QUALity+33
12 META metadata
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
34. cations
Record Column
Col Field Type 1 QNAME IL31 4368:1:42:12530:7509
2 FLAG 137
3 RNAME chr1
4 POS 10
5 MAPQ 30
6 CIGAR 54M
7 RNEXT =
8 PNEXT 100
9 TLEN 90
10 SEQ TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGATC
11 QUAL GGGGGGGFEGGGGCFGGGGGEGGFGEGGFGFGGFGFEGFCFFBECCBDACB@?B
12 META XT:A:R NM:i:3 SM:i:0 AM:i:0 X0:i:11 X1:i:0 XM:i:3 XO:i:0 XG:i:0 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
35. SAM FLAGS
read paired.
read mapped in proper pair.
read unmapped.
mate unmapped.
read reverse strand.
mate reverse strand.
36. rst in pair.
second in pair.
not primary alignment.
read fails platform/vendor quality checks.
read is PCR or optical duplicate.
supplementary alignment
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
37. SAM FLAGS
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
38. SAM FLAGS
Read Paired
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
39. SAM FLAGS
Read mapped in proper pair
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
40. Read mapped in proper pair
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
41. SAM FLAGS
Read unmapped
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
42. SAM FLAGS
Mate unmapped
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
43. SAM FLAGS
Read reverse strand
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
44. SAM FLAGS
Mate reverse strand
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
45. SAM FLAGS
First in pair
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
46. SAM FLAGS
Second in pair
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
47. SAM FLAGS
not primary alignment
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
48. SAM FLAGS
read fails platform/vendor quality checks
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
49. SAM FLAGS
read is PCR or optical duplicate
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
50. SAM FLAGS
supplementary alignment
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
51. SAM CIGAR
The CIGAR string is a sequence of of base lengths and the
associated operation. They are used to indicate things like which
bases align (either a match/mismatch) with the reference, are
deleted from the reference, and are insertions that are not in the
reference.
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
52. SAM Cigar
Op BAM Description
M 0 alignment match (can be a sequence match or mismatch)
I 1 insertion to the reference
D 2 deletion from the reference
N 3 skipped region from the reference
S 4 soft clipping (clipped sequences present in SEQ)
H 5 hard clipping (clipped sequences NOT present in SEQ)
P 6 padding (silent deletion from padded reference)
= 7 sequence match
X 8 sequence mismatch
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
53. SAM Cigar
http://genome.sph.umich.edu/wiki/SAM
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: ACTAGAATGGCT
Aligning these two:
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: A C T A G A A T G G C T
With the alignment above, you get:
POS: 5
CIGAR: 3M1I3M1D5M
or
CIGAR: 3=1I3=1D2=1X2=
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
54. SAM Cigar
Soft Clip
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
55. SAM Cigar
Hard Clip
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
57. elds on a SAM/BAM Alignment. A TAG is comprised of
a two character TAG key, they type of the value, and the value:
[A-Za-z][A-za-z]:[AifZH]:.*
The types, A, i, f, Z, H are used to indicate the type of value
stored in the tag.
Type Description
A character
i signed 32-bit integer
f single-precision
oat
Z string
H hex string
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
59. ned tag called XT. It holds a character.
The value associated with this tag is 'U'.
NM:i:2 - prede
60. ned tag NM means: Edit distance to the
reference (number of changes necessary to make this equal
the reference, excluding clipping)
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
64. le format (Sequence Alignment/Map) comes in a
plain text format (SAM), and a compressed binary format (BAM).
The latter uses a modi
65. ed form of gzip compression called BGZF
(Blocked GNU Zip Format), which can be applied to any
66. le
format to provide compression with ecient random access
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
67. BAM INDEX
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
68. CRAM
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
69. VCF
Variant Call Format
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
71. le format (most likely stored in a compressed
manner). It contains meta-information lines, a header line, and
then data lines each containing information about a position in the
genome.
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
72. VCF
Example
##f i l e f o rma t=VCFv4 . 0
##f i l eDa t e =20090805
##s o u r c e=myImputat ionProgramV3 . 1
##r e f e r e n c e =1000GenomesPi lotNCBI36
##p h a s i n g=p a r t i a l
##INFO=ID=NS, Number=1,Type=I n t e g e r , De s c r i p t i o n=Number o f Samples With Data
##INFO=ID=DP, Number=1,Type=I n t e g e r , De s c r i p t i o n=Tot a l Depth
##INFO=ID=AF, Number=. ,Type=Floa t , De s c r i p t i o n=A l l e l e Fr equency
##INFO=ID=AA, Number=1,Type=St r i n g , De s c r i p t i o n=An c e s t r a l A l l e l e
##INFO=ID=DB, Number=0,Type=Flag , De s c r i p t i o n=dbSNP membership , b u i l d 129
##INFO=ID=H2 , Number=0,Type=Flag , De s c r i p t i o n=HapMap2 membership
##FILTER=ID=q10 , De s c r i p t i o n=Qu a l i t y below 10
##FILTER=ID=s50 , De s c r i p t i o n=Le s s than 50% o f sampl e s have data
##FORMAT=ID=GT, Number=1,Type=St r i n g , De s c r i p t i o n=Genotype
##FORMAT=ID=GQ, Number=1,Type=I n t e g e r , De s c r i p t i o n=Genotype Qu a l i t y
##FORMAT=ID=DP, Number=1,Type=I n t e g e r , De s c r i p t i o n=Read Depth
##FORMAT=ID=HQ, Number=2,Type=I n t e g e r , De s c r i p t i o n=Haplot ype Qu a l i t y
#CHROM POS ID REF ALT QUAL FILTER INFO
FORMAT NA00001 NA00002 NA00003
20 14370 r s6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2
GT:GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1 j 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , .
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017
GT:GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0 j 1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3
20 1110696 r s6040355 A G,T 67 PASS NS=2;DP=10;AF=0 . 3 3 3 , 0 . 6 6 7 ;AA=T;DB GT:GQ:DP:HQ 2 / 2 : 3 5 : 4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T
GT:GQ:DP:HQ 0 j 0 : 5 4 : 7 : 5 6 , 6 0 0 j 0 : 4 8 : 4 : 5 1 , 5 1 0 / 0 : 6 1 : 2
20 1234567 mi c r o s a t 1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G
GT:GQ:DP 0 / 1 : 3 5 : 4 0 / 2 : 1 7 : 2 1 / 1 : 4 0 : 3
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
73. VCF
Column
CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
SAMPLE-1
SAMPLE-2
SAMPLE-3
...
(...)
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
75. elds should be described as follows
##INFO=ID=ID , Number=number , Type=type ,
De s c r i p t i o n= d e s c r i p t i o n
( . . . )
##INFO=ID= NS ,Number=1,Type=I n t e g e r , De s c r i p t i o n=Number o f Samples With Data
( . . . )
INFO FORMAT NA00001 NA00002 NA00003
20 14370 r s6054257 G A 29 PASS NS=3 ;DP=14;AF=0.5;DB;H2
GT:GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1 j 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , .
20 17330 . T A 3 q10 NS=3 ;DP=11;AF=0.017
GT:GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0 j 1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
76. VCF
FILTERs
FILTERs that have been applied to the data should be described as
follows:
##FILTER=ID=ID , De s c r i p t i o n= d e s c r i p t i o n
( . . . )
##FILTER=ID=q10 , De s c r i p t i o n=Qu a l i t y below 10
##FILTER=ID=s50 , De s c r i p t i o n=Le s s than 50 p e r c e n t o f sampl e s have data
( . . . )
#CHROM POS ID REF ALT QUAL FILTER INFO
FORMAT NA00001 NA00002 NA00003
20 14370 r s6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2
GT:GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1 j 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , .
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017
GT:GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0 j 1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3
20 1110696 r s6040355 A G,T 67 PASS NS=2;DP=10;AF=0 . 3 3 3 , 0 . 6 6 7 ;AA=T;DB GT:GQ:DP:HQ 2 / 2 : 3 5 : 4
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
80. eld should be described
as follows:
##FORMAT=ID=ID , Number=number , Type=type ,
De s c r i p t i o n= d e s c r i p t i o n
( . . . )
##FORMAT=ID=GT ,Number=1,Type=St r i n g , De s c r i p t i o n=Genotype
( . . . )
#CHROM POS ID REF ALT QUAL FILTER INFO
FORMAT NA00001 NA00002 NA00003
20 14370 r s6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT
:GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1/0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , .
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT
:GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0/1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3
20 1110696 r s6040355 A G,T 67 PASS NS=2;DP=10;AF=0 . 3 3 3 , 0 . 6 6 7 ;AA=T;DB GT:GQ:DP:HQ 2 / 2 : 3 5 : 4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT
:GQ:DP:HQ 0 j 0 : 5 4 : 7 : 5 6 , 6 0 0/0 : 4 8 : 4 : 5 1 , 5 1 0 / 0 : 6 1 : 2
20 1234567 mi c r o s a t 1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT
:GQ:DP 0 / 1 : 3 5 : 4 0/2 : 1 7 : 2 1 / 1 : 4 0 : 3
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
81. Tabix
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
82. Binning
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
83. Tabix INDEX
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
84. Building the TABIX index
$ b g z i p f f i l e . v c f
$ t a b i x p v c f f i l e . v c f . gz
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
85. Querying the TABIX index
$ t a b i x f i l e . v c f . gz chr3 :1235456778
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
86. API
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
87. Reading SAM with the samtools C library
#inc lude s t d l i b . h
#inc lude s t d i o . h
#inc lude bam. h
#inc lude sam. h
i n t main ( i n t argc , char a rgv [ ] ) f
s amf i l e t sam=samopen ( a rgv [ 1 ] , rb , 0 ) ;
bam1 t b= b am i n i t 1 ( ) ;
long n=0L ;
whi le ( samread ( sam , b ) 0)
f
i f ( ! ( bc o r e . f l a gBAM FUNMAP) ) ++n ;
g
bam de s t roy1 ( b ) ;
s amc l o s e ( sam ) ;
p r i n t f ( %l u nn , n ) ;
return 0 ;
g
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
88. Reading SAM with the java picard library
import j a v a . i o . F i l e ;
import ne t . s f . s amt o o l s . ;
publ i c c l a s s CountMapped f
publ i c s t a t i c void main ( S t r i n g [ ] a r g s ) f
long n=0L ;
F i l e f=new F i l e ( a r g s [ 0 ] ) ;
SAMFi leReader sam = new SAMFi leReader ( f ) ;
for ( SAMRecord r e c : sam)
f
i f ( ! r e c . getReadUnmapped ( ) )
f
++n ;
g
g
sam. c l o s e ( ) ;
System . out . p r i n t l n ( n ) ;
g
Pierre Lgindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
89. End
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
90. Credits
Angus: http://ged.msu.edu/angus/
Wikipedia: https://en.wikibooks.org/wiki/C%2B%2B_
Programming/Programming_Languages/C%2B%2B/Code/
Statements/Variables
Abecasis Group Wiki:
http://genome.sph.umich.edu/wiki/SAM
Genome Research
http://genome.cshlp.org/content/12/6/996
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses