SlideShare une entreprise Scribd logo
1  sur  73
Next Generation Sequencing 
File Formats. 
Pierre Lindenbaum 
@yokofakun 
pierre.lindenbaum@univ-nantes.fr 
http://plindenbaum.blogspot.com 
https://github.com/lindenb/courses 
Institut du Thorax. Nantes. France 
September 19, 2014 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
You don't need to have a deep knowledge of those formats. 
(Unless you're doing NGS) 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Understand how people have solved their BIG data problems. 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Why sequencing ? 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Well, that's a little more complicated ... 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ 
FASTQ: text-based format for storing both a DNA sequence and 
its corresponding quality scores 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ 
FASTQ for single end 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ 
FASTQ for paired end 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ Example 
@IL31_4368:1:1:996:8507/1 
NTGATAAAGTAATGACAAAATAATGACATTATTGTTACTATGGTTACTGTGGGA 
+ 
(94**0-)*7=06>>><<<<<<22@>6;;;5;6:;63:4?-622647..-.5.% 
@IL31_4368:1:1:996:21421/1 
NAAGTTAATTCTTCATTGTCCATTCCTCTGAAATGATTCAGAAATACTGGTAGT 
+ 
(**+*2396,@<+<:@@@;;5)<0)69606>4;5>;>6&<102)0*+8:&137; 
@IL31_4368:1:1:997:10572/1 
NAATGTATGTAGACCCTTCACATTCAAAGGCAAATACAATATCATCATGTCTTC 
+ 
(/9**-0032>:>>9>4@@=>??@@:-66,;>;<;6+;255,1;7>>>>3676' 
@IL31_4368:1:1:997:15684/1 
NGCAATCAATGCTATGATTGATCCTGATGGAACTTTGGAGGCTCTGAACAACAT 
+ 
()1,*37766>@@@>?@<?@@:>@0>>><-888>8;>*;966>;;;@8@4,.2. 
@IL31_4368:1:1:997:15249/1 
NCGTTATAATGGAATTATTTTTCTTCCTTTATTTAATGTGTTGACAAAGAGAAC 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ name 
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG 
Col Brief description 
EAS139 the unique instrument name 
136 the run id 
FC706VJ the 
owcell id 
2 
owcell lane 
2104 tile number within the 
owcell lane 
15343 'x'-coordinate of the cluster within the tile 
197393 'y'-coordinate of the cluster within the tile 
1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only) 
Y Y if the read fails
lter (read is bad), N otherwise 
18 0 when none of the control bits are on, otherwise it is an even number 
ATCACG index sequence 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ Quality 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ Quality 
A quality value Q is an integer mapping of p (i.e., the probability 
that the corresponding base call is incorrect). 
Qsanger = 10 log10 p 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ Quality 
Since a human readable format is desired for SAM, 33 is added to 
the calculated quality in order to make it a printable character 
ranging from ! - . 
Qsanger = 10 log10 p + 33 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Aligned Reads 
44187101 44187111 44187121 44187131 44187141 44187151 44187161 44187171 
aaatgagccaggtgtggtggtgcacacctatagtcccagctacgcaggaggctgaggtgggaggatcgcttaaacccggc REFERENCE 
............................................Y................................... CONSENSUS 
aaa gagccaggtgtggtggtgcacaccgataggcccagctacgtaggaggctgaggtgggaggatcgcttaaa cggc 
AAA GAGCCAGGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAA CGGC 
aaatga CCAGGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCC c 
aaatgagcc GGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC 
AAATGAGCCAGG gtggtggtgcacacctatagtcccagcgacgtaggaggctgaggtgggaggatcgcttaaacccggc 
AAATGAGCCAGGTG ggtggtgcacacctatagtcccagctaagtaggaggctgaggtgggaggatcgctttaacccggc 
AAATGAGCCAGGTGT GTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCGGGC 
ACATGAGCCAGGTGTG tggtgcacacctatagtcccagctacgtaggaggctgaggtgggaggatcgcttaaacccggc 
aaatgagccaggtgtgg GCACACGTAAAGTCCCAGCTACGCAGGAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC 
CAATGAGCCAGTTGTGG cacacctatagtcccagctacgcacgaggctgaggtgggaggatcgctttaacccggc 
AAATGAGCCAGGTGAGGT cacacctatagtcccagctacgcaggaggctgaggtgggaggatcgcttaaacccggc 
AAATGAGCCAGGTGTGGT acacctatagtcccagctacgcaggaggctgaggtgggaggatcgctttaacccggc 
aaatgagccaggtgtggtgg cctatagtcccagctacgtaggaggctgaggtgggaggatcgcttaaacccggc 
AAATGAGCCAGGTGTGGTGG TATAGTCCCAGCTACGCAGGAGGCTGAGGTGGTAGGATCGCATAAACCCGGC 
AAATGAGCCAGGTGTGGTGGT TAGTCCCAGCTACGTAGGAGGCTGAGTTGGGAGGATCTCTTAAACCCGGC 
aaatgagccaggtgtggtggtg TCGTCCCAGCTACGCAGGAGGCTTAGGTGGGAGGATCGCTTAAACCCGGC 
aaatgagccaggtgtggtggtgca AGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGGTTAAACCCGGC 
aaatgagccaggtgtggtggtgcac cccagctacgcaggaggctgaggtgggaccatcgcttaaaccccgc 
aaatgagccaggtgtggtggtgcac CCAGCTACGTAGTAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Format 
SAM (Sequence Alignment/Map) format is a generic format for 
storing large nucleotide sequence alignments 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Format 
Is 
exible enough to store all the alignment information 
generated by various alignment programs; 
Is simple enough to be easily generated by alignment 
programs or converted from existing alignment formats; 
Is compact in
le size; 
Allows most of operations on the alignment to work on a 
stream without loading the whole alignment into memory; 
Allows the
le to be indexed by genomic position to eciently 
retrieve all reads aligning to a locus. 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Format 
Structure 
+ HEADER 
-version 
-program parameters 
+GENOME 
- chrom1 size 
- chrom2 size 
- chrom3 size 
- (..) 
+GROUPS 
- group1 : sample1, lane 4 
- group2 : sample2, lane 1 
+ BODY 
- READ1 - group1 
- READ2 - group1 
- READ3 - group1 
- READ4 - group2 
Pierre Linde-nbau(m.@y.o.ko)fakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Example 
Simple example 
@HD VN:1.5 SO:coordinate 
@SQ SN:ref LN:45 
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * 
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * 
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; 
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * 
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; 
r001 83 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Header Section 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Header 
@HD VN:1.0 SO:coordinate 
@SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 
@SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e 
@SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5 
@RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 @RG ID:UM0098:2 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L002 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 @PG ID:bwa VN:0.5.4 
@PG ID:GATK TableRecalibration VN:1.0.3471 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, CycleCovariate, Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Alignment Section 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Example 
Simple example 
IL31_4368:1:1:996:8507 77 * 0 0 * * 0 0 NTGATAAAGTAATGACAAAATAATGACATTATTGTTACTATGGTTACTGTGGGA (94**0-)*7=0622@6;;;5;6:;63:4?-622647..-.5.% 
IL31_4368:1:1:996:8507 141 * 0 0 * * 0 0 TCCCTTACCCCCAAGCTCCATACCCTCCTAATGCCCACACCTCTTACCTTAGGA FFCEFFFEEFFFFFFFEFFEFFFEFCFCEEFEFFFCEFF;EEFF=FEE?FCE 
IL31_4368:1:1:996:21421 77 * 0 0 * * 0 0 NAAGTTAATTCTTCATTGTCCATTCCTCTGAAATGATTCAGAAATACTGGTAGT (**+*2396,@+:@@@;;5)0)696064;5;6102)0*+8:137; 
IL31_4368:1:1:996:21421 141 * 0 0 * * 0 0 CAAAAACTTTCACTTTACCTGCCGGGTTTCCCAGTTTACATTCCACTGTTTGAC DBDDB,B9BAA4AAB7BB?7BBB=91;+*@;587+*=/*@@?9=73=.7)7* 
IL31_4368:1:1:997:10572 77 * 0 0 * * 0 0 NAATGTATGTAGACCCTTCACATTCAAAGGCAAATACAATATCATCATGTCTTC (/9**-0032:94@@=??@@:-66,;;;6+;255,1;73676' 
IL31_4368:1:1:997:10572 141 * 0 0 * * 0 0 GATCTTCTGTGACTGGAAGAAAATGTGTTACATATTACATTTCTGTCCCCATTG E?=EECEEEEE98EEEEAEEBD??BE@AEABEEABCEEDECEBDA=DEE 
IL31_4368:1:1:997:15684 83 chr1 241356612 60 54M = 241356442 -224 ATGTTGTTCAGAGCCTCCAAAGTTCCATCAGGATCAATCATAGCATTGATTGCN IL31_4368:1:1:997:15684 163 chr1 241356442 60 54M = 241356612 224 CAGCCTCAGATTCAGCATTCTCAAATTCAGCTGCGGCTGAAACAGCAGCAGGAC IL31_4368:1:1:997:15249 77 * 0 0 * * 0 0 NCGTTATAATGGAATTATTTTTCTTCCTTTATTTAATGTGTTGACAAAGAGAAC (916928.82@@054;33222224;@2?22;5=;;858*0666 
IL31_4368:1:1:997:15249 141 * 0 0 * * 0 0 AATGTTCTGAAACCTCTGAGAAAGCAAATATTTATTTTAATGAAAAATCCTTAT EDEEC;EEE;EEE?EECE;7AEEEEEE07EECEA;D6D+EE4E7EEE4;E=EA 
IL31_4368:1:1:997:6273 77 * 0 0 * * 0 0 NTACGAAGAAGTATTTCATTGGGAGGAGCTTATCCAAATATTTCCTGTCTATCC (**4*5-*3299+::@2;;853+39;0.3)-)79)..'5.988*200 
IL31_4368:1:1:997:6273 141 * 0 0 * * 0 0 ACATTTACCAAGACCAAAGGAAACTTACCTTGCAAGAATTAGACAGTTCATTTG EEAAFFFEEFEFCFAFFAFCCFFEFEFEFFFFB?ABA@ECEE=F@DE@DDF; 
IL31_4368:1:1:997:1657 83 chr1 143630364 60 54M = 143630066 -352 TACCTTTTTAAAGAGATCTAAAATTGTCACATGGTTATTAGATACAGAGGCCTN IL31_4368:1:1:997:1657 163 chr1 143630066 60 54M = 143630364 352 CCCACCTCTCTCAATGTTTTCCATATGGCAGGGACTCAGCACAGGTGGATTAAT IL31_4368:1:1:997:5609 77 * 0 0 * * 0 0 NGGTGTCTCTTACGGACAGCATTAAGCTAGATTCTTTTTAGACCGATCTGCCAA (*+*,1426;@@??@?9@@@@4?666260.)-*9;;;8:'0418 
IL31_4368:1:1:997:5609 141 * 0 0 * * 0 0 TCACTATCAGAAACAGAATGTATAACTTCCAAATCAGTAGGAAACACAAGGAAA AEECECBEC@A;AC=AEEEEAEEEEAC,CE?ECCE9EAEC4E:CAC@EE) 
IL31_4368:1:1:997:14262 77 * 0 0 * * 0 0 NGAGAACCAATGGGAAGCAGCCTGAGCTGCTGGAACCTATTCCCCATGACTTCA (9136242-2@@@;96.@@@@0$2623.':**+3*03137..--. 
IL31_4368:1:1:997:14262 141 * 0 0 * * 0 0 TGTTTTTTCTTTTTCTTTTTTTTTTGACAGTGCAGAGATTTTTTATCTTTTTAA 97'2.64.?7/3(891?=(6??6+6++/*..3(:'/'9::''(1.(, 
IL31_4368:1:1:998:19914 77 * 0 0 * * 0 0 NAGAGCATTGACACACATAAAAAATTAAAACAACCCTTTGTACTTACGGTAGAA (/89255@?7..())@@@;2265267@@8..3;/$ 
IL31_4368:1:1:998:19914 141 * 0 0 * * 0 0 GAATGAAAGCAGAGACCCTGATCGAGCCCCAGAAAGATACACCTCCAGATTTTA C?=CECE4CD?8@==;EBE=0@:@@92@???6991.?A=@5?@99;971 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Example 
Sorted SAM 
One row is one read, NOT one fragment. 
IL31_4368:1:107:15207:19097 163 chr1 17 0 54M = 21 58 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC IL31_4368:1:107:15207:19097 83 chr1 21 0 54M = 17 -58 ACCCTACCCCTAGCCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA IL31_4368:1:10:17817:9758 137 chr1 23 0 54M = 23 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCAGATC IL31_4368:1:54:13142:21400 163 chr1 37 0 54M = 44 61 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC IL31_4368:1:54:13142:21400 83 chr1 44 0 54M = 37 -61 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Speci
cations 
Record Column 
Col Field Type Brief description 
1 QNAME String Query template NAME 
2 FLAG Int bitwise FLAG 
3 RNAME String Reference sequence NAME 
4 POS Int 1-based leftmost mapping POSition 
5 MAPQ Int MAPping Quality 
6 CIGAR String CIGAR string 
7 RNEXT String Ref. name of the mate/next read 
8 PNEXT Int Position of the mate/next read 
9 TLEN Int observed Template LENgth 
10 SEQ String segment SEQuence 
11 QUAL String ASCII of Phred-scaled base QUALity+33 
12 META metadata 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Speci
cations 
Record Column 
Col Field Type 1 QNAME IL31 4368:1:42:12530:7509 
2 FLAG 137 
3 RNAME chr1 
4 POS 10 
5 MAPQ 30 
6 CIGAR 54M 
7 RNEXT = 
8 PNEXT 100 
9 TLEN 90 
10 SEQ TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGATC 
11 QUAL GGGGGGGFEGGGGCFGGGGGEGGFGEGGFGFGGFGFEGFCFFBECCBDACB@?B 
12 META XT:A:R NM:i:3 SM:i:0 AM:i:0 X0:i:11 X1:i:0 XM:i:3 XO:i:0 XG:i:0 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
 read paired. 
 read mapped in proper pair. 
 read unmapped. 
 mate unmapped. 
 read reverse strand. 
 mate reverse strand.
rst in pair. 
 second in pair. 
 not primary alignment. 
 read fails platform/vendor quality checks. 
 read is PCR or optical duplicate. 
 supplementary alignment 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Read Paired 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Read mapped in proper pair 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Read mapped in proper pair 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Read unmapped 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Mate unmapped 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Read reverse strand 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Mate reverse strand 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
First in pair 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Second in pair 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
not primary alignment 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
read fails platform/vendor quality checks 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
read is PCR or optical duplicate 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
supplementary alignment 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM CIGAR 
The CIGAR string is a sequence of of base lengths and the 
associated operation. They are used to indicate things like which 
bases align (either a match/mismatch) with the reference, are 
deleted from the reference, and are insertions that are not in the 
reference. 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Cigar 
Op BAM Description 
M 0 alignment match (can be a sequence match or mismatch) 
I 1 insertion to the reference 
D 2 deletion from the reference 
N 3 skipped region from the reference 
S 4 soft clipping (clipped sequences present in SEQ) 
H 5 hard clipping (clipped sequences NOT present in SEQ) 
P 6 padding (silent deletion from padded reference) 
= 7 sequence match 
X 8 sequence mismatch 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Cigar 
http://genome.sph.umich.edu/wiki/SAM 
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 
Reference: C C A T A C T G A A C T G A C T A A C 
Read: ACTAGAATGGCT 
Aligning these two: 
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 
Reference: C C A T A C T G A A C T G A C T A A C 
Read: A C T A G A A T G G C T 
With the alignment above, you get: 
POS: 5 
CIGAR: 3M1I3M1D5M 
or 
CIGAR: 3=1I3=1D2=1X2= 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Cigar 
Soft Clip 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Cigar 
Hard Clip 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Fomat 
optional TAGs 
optional
elds on a SAM/BAM Alignment. A TAG is comprised of 
a two character TAG key, they type of the value, and the value: 
[A-Za-z][A-za-z]:[AifZH]:.* 
The types, A, i, f, Z, H are used to indicate the type of value 
stored in the tag. 
Type Description 
A character 
i signed 32-bit integer 
f single-precision 
oat 
Z string 
H hex string 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Fomat 
optional TAGs 
XT:A:U - user de
ned tag called XT. It holds a character. 
The value associated with this tag is 'U'. 
NM:i:2 - prede
ned tag NM means: Edit distance to the 
reference (number of changes necessary to make this equal 
the reference, excluding clipping) 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Example 
Sorted SAM 
IL31_4368:1:107:15207:19097 163 chr1 17 0 54M = 21 58 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC IL31_4368:1:107:15207:19097 83 chr1 21 0 54M = 17 -58 ACCCTACCCCTAGCCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA IL31_4368:1:54:13142:21400 163 chr1 37 0 54M = 44 61 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC IL31_4368:1:54:13142:21400 83 chr1 44 0 54M = 37 -61 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
BAM 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
BGZF Format 
The SAM/BAM
le format (Sequence Alignment/Map) comes in a 
plain text format (SAM), and a compressed binary format (BAM). 
The latter uses a modi
ed form of gzip compression called BGZF 
(Blocked GNU Zip Format), which can be applied to any
le 
format to provide compression with ecient random access 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
BAM INDEX 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
CRAM 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
VCF 
Variant Call Format 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
VCF Format 
VCF is a text
le format (most likely stored in a compressed 
manner). It contains meta-information lines, a header line, and 
then data lines each containing information about a position in the 
genome. 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
VCF 
Example 
##f i l e f o rma t=VCFv4 . 0 
##f i l eDa t e =20090805 
##s o u r c e=myImputat ionProgramV3 . 1 
##r e f e r e n c e =1000GenomesPi lotNCBI36 
##p h a s i n g=p a r t i a l 
##INFO=ID=NS, Number=1,Type=I n t e g e r , De s c r i p t i o n=Number o f Samples With Data 
##INFO=ID=DP, Number=1,Type=I n t e g e r , De s c r i p t i o n=Tot a l Depth 
##INFO=ID=AF, Number=. ,Type=Floa t , De s c r i p t i o n=A l l e l e Fr equency 
##INFO=ID=AA, Number=1,Type=St r i n g , De s c r i p t i o n=An c e s t r a l A l l e l e  
##INFO=ID=DB, Number=0,Type=Flag , De s c r i p t i o n=dbSNP membership , b u i l d 129 
##INFO=ID=H2 , Number=0,Type=Flag , De s c r i p t i o n=HapMap2 membership 
##FILTER=ID=q10 , De s c r i p t i o n=Qu a l i t y below 10 
##FILTER=ID=s50 , De s c r i p t i o n=Le s s than 50% o f sampl e s have data 
##FORMAT=ID=GT, Number=1,Type=St r i n g , De s c r i p t i o n=Genotype 
##FORMAT=ID=GQ, Number=1,Type=I n t e g e r , De s c r i p t i o n=Genotype Qu a l i t y 
##FORMAT=ID=DP, Number=1,Type=I n t e g e r , De s c r i p t i o n=Read Depth 
##FORMAT=ID=HQ, Number=2,Type=I n t e g e r , De s c r i p t i o n=Haplot ype Qu a l i t y 
#CHROM POS ID REF ALT QUAL FILTER INFO 
FORMAT NA00001 NA00002 NA00003 
20 14370 r s6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 
GT:GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1 j 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , . 
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 
GT:GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0 j 1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3 
20 1110696 r s6040355 A G,T 67 PASS NS=2;DP=10;AF=0 . 3 3 3 , 0 . 6 6 7 ;AA=T;DB GT:GQ:DP:HQ 2 / 2 : 3 5 : 4 
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T 
GT:GQ:DP:HQ 0 j 0 : 5 4 : 7 : 5 6 , 6 0 0 j 0 : 4 8 : 4 : 5 1 , 5 1 0 / 0 : 6 1 : 2 
20 1234567 mi c r o s a t 1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G 
GT:GQ:DP 0 / 1 : 3 5 : 4 0 / 2 : 1 7 : 2 1 / 1 : 4 0 : 3 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
VCF 
Column 
CHROM 
POS 
ID 
REF 
ALT 
QUAL 
FILTER 
INFO 
FORMAT 
SAMPLE-1 
SAMPLE-2 
SAMPLE-3 
... 
(...) 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses

Contenu connexe

Tendances

Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
Phage display technology
Phage display technologyPhage display technology
Phage display technologyEchoHan4
 
Transcriptome Analysis & Applications
Transcriptome Analysis & ApplicationsTranscriptome Analysis & Applications
Transcriptome Analysis & Applications1010Genome Pte Ltd
 
Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)LOGESWARAN KA
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingTapish Goel
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods Mrinal Vashisth
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Yaoyu Wang
 
The uni prot knowledgebase
The uni prot knowledgebaseThe uni prot knowledgebase
The uni prot knowledgebaseKew Sama
 
SNP Detection Methods and applications
SNP Detection Methods and applications SNP Detection Methods and applications
SNP Detection Methods and applications Aneela Rafiq
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation SequencingArindam Ghosh
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingDayananda Salam
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
Using methylation patterns to determine origin of biological material and age
Using methylation patterns to determine origin of biological material and ageUsing methylation patterns to determine origin of biological material and age
Using methylation patterns to determine origin of biological material and ageQIAGEN
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 

Tendances (20)

Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Phage display technology
Phage display technologyPhage display technology
Phage display technology
 
ChIP-seq Theory
ChIP-seq TheoryChIP-seq Theory
ChIP-seq Theory
 
Transcriptome Analysis & Applications
Transcriptome Analysis & ApplicationsTranscriptome Analysis & Applications
Transcriptome Analysis & Applications
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1
 
The uni prot knowledgebase
The uni prot knowledgebaseThe uni prot knowledgebase
The uni prot knowledgebase
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
Basic Steps of the NGS Method
Basic Steps of the NGS MethodBasic Steps of the NGS Method
Basic Steps of the NGS Method
 
SNP Detection Methods and applications
SNP Detection Methods and applications SNP Detection Methods and applications
SNP Detection Methods and applications
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Using methylation patterns to determine origin of biological material and age
Using methylation patterns to determine origin of biological material and ageUsing methylation patterns to determine origin of biological material and age
Using methylation patterns to determine origin of biological material and age
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 

Similaire à File formats for Next Generation Sequencing

Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learningtrygub
 
Message-passing concurrency in Python
Message-passing concurrency in PythonMessage-passing concurrency in Python
Message-passing concurrency in PythonSarah Mount
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Anand Sampat
 
Pythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxPythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxMrHackerxD
 
(Slightly) Smarter Smart Pointers
(Slightly) Smarter Smart Pointers(Slightly) Smarter Smart Pointers
(Slightly) Smarter Smart PointersCarlo Pescio
 
Pypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequelPypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequelMark Rees
 
Operating System Practice : Meeting 2-basic commands linux operating system-s...
Operating System Practice : Meeting 2-basic commands linux operating system-s...Operating System Practice : Meeting 2-basic commands linux operating system-s...
Operating System Practice : Meeting 2-basic commands linux operating system-s...Syaiful Ahdan
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2
 
Tracing python applications
Tracing python applicationsTracing python applications
Tracing python applicationsNikolay Stoitsev
 
Using TypeScript at Dashlane
Using TypeScript at DashlaneUsing TypeScript at Dashlane
Using TypeScript at DashlaneDashlane
 
Brogramming - Python, Bash for Data Processing, and Git
Brogramming - Python, Bash for Data Processing, and GitBrogramming - Python, Bash for Data Processing, and Git
Brogramming - Python, Bash for Data Processing, and GitRon Reiter
 
Parallel computing in Python: Current state and recent advances
Parallel computing in Python: Current state and recent advancesParallel computing in Python: Current state and recent advances
Parallel computing in Python: Current state and recent advancesPierre Glaser
 
Why Python is better for Data Science
Why Python is better for Data ScienceWhy Python is better for Data Science
Why Python is better for Data ScienceÍcaro Medeiros
 
The Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerThe Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerSasha Goldshtein
 
MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.agramfort
 
A CTF Hackers Toolbox
A CTF Hackers ToolboxA CTF Hackers Toolbox
A CTF Hackers ToolboxStefan
 

Similaire à File formats for Next Generation Sequencing (20)

Introduction to Linux
Introduction to LinuxIntroduction to Linux
Introduction to Linux
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learning
 
Message-passing concurrency in Python
Message-passing concurrency in PythonMessage-passing concurrency in Python
Message-passing concurrency in Python
 
Qt Translations
Qt TranslationsQt Translations
Qt Translations
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
 
Pythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxPythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptx
 
Pyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdfPyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdf
 
(Slightly) Smarter Smart Pointers
(Slightly) Smarter Smart Pointers(Slightly) Smarter Smart Pointers
(Slightly) Smarter Smart Pointers
 
Pypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequelPypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequel
 
Operating System Practice : Meeting 2-basic commands linux operating system-s...
Operating System Practice : Meeting 2-basic commands linux operating system-s...Operating System Practice : Meeting 2-basic commands linux operating system-s...
Operating System Practice : Meeting 2-basic commands linux operating system-s...
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
Tracing python applications
Tracing python applicationsTracing python applications
Tracing python applications
 
Using TypeScript at Dashlane
Using TypeScript at DashlaneUsing TypeScript at Dashlane
Using TypeScript at Dashlane
 
Brogramming - Python, Bash for Data Processing, and Git
Brogramming - Python, Bash for Data Processing, and GitBrogramming - Python, Bash for Data Processing, and Git
Brogramming - Python, Bash for Data Processing, and Git
 
Parallel computing in Python: Current state and recent advances
Parallel computing in Python: Current state and recent advancesParallel computing in Python: Current state and recent advances
Parallel computing in Python: Current state and recent advances
 
Why Python is better for Data Science
Why Python is better for Data ScienceWhy Python is better for Data Science
Why Python is better for Data Science
 
The Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerThe Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF Primer
 
MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.
 
A CTF Hackers Toolbox
A CTF Hackers ToolboxA CTF Hackers Toolbox
A CTF Hackers Toolbox
 

Plus de Pierre Lindenbaum

Plus de Pierre Lindenbaum (20)

Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !
 
"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)
 
Advanced NCBI
Advanced NCBI Advanced NCBI
Advanced NCBI
 
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebookBuilding a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
 
Make
MakeMake
Make
 
XML for bioinformatics
XML for bioinformaticsXML for bioinformatics
XML for bioinformatics
 
20120423.NGS.Rennes
20120423.NGS.Rennes20120423.NGS.Rennes
20120423.NGS.Rennes
 
Sketching 20120412
Sketching 20120412Sketching 20120412
Sketching 20120412
 
Introduction to mongodb for bioinformatics
Introduction to mongodb for bioinformaticsIntroduction to mongodb for bioinformatics
Introduction to mongodb for bioinformatics
 
Biostar17037
Biostar17037Biostar17037
Biostar17037
 
Tweeting for the BioStar Paper
Tweeting for the BioStar PaperTweeting for the BioStar Paper
Tweeting for the BioStar Paper
 
Variation Toolkit
Variation ToolkitVariation Toolkit
Variation Toolkit
 
Bioinformatician 2.0
Bioinformatician 2.0Bioinformatician 2.0
Bioinformatician 2.0
 
Analyzing Exome Data with KNIME
Analyzing Exome Data with KNIMEAnalyzing Exome Data with KNIME
Analyzing Exome Data with KNIME
 
NOTCH2 backstage
NOTCH2 backstageNOTCH2 backstage
NOTCH2 backstage
 
Bioinfo tweets
Bioinfo tweetsBioinfo tweets
Bioinfo tweets
 
Post doctoriales 2011
Post doctoriales 2011Post doctoriales 2011
Post doctoriales 2011
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course
 
MyWordle.java
MyWordle.javaMyWordle.java
MyWordle.java
 
Biblio2.0
Biblio2.0Biblio2.0
Biblio2.0
 

Dernier

SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptxSYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptxdrashraf369
 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformKweku Zurek
 
Informed Consent Empowering Healthcare Decision-Making.pptx
Informed Consent Empowering Healthcare Decision-Making.pptxInformed Consent Empowering Healthcare Decision-Making.pptx
Informed Consent Empowering Healthcare Decision-Making.pptxSasikiranMarri
 
POST NATAL EXERCISES AND ITS IMPACT.pptx
POST NATAL EXERCISES AND ITS IMPACT.pptxPOST NATAL EXERCISES AND ITS IMPACT.pptx
POST NATAL EXERCISES AND ITS IMPACT.pptxvirengeeta
 
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisVarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisGolden Helix
 
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptx
COVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptxCOVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptx
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptxBibekananda shah
 
Primary headache and facial pain. (2024)
Primary headache and facial pain. (2024)Primary headache and facial pain. (2024)
Primary headache and facial pain. (2024)Mohamed Rizk Khodair
 
world health day presentation ppt download
world health day presentation ppt downloadworld health day presentation ppt download
world health day presentation ppt downloadAnkitKumar311566
 
Let's Talk About It: To Disclose or Not to Disclose?
Let's Talk About It: To Disclose or Not to Disclose?Let's Talk About It: To Disclose or Not to Disclose?
Let's Talk About It: To Disclose or Not to Disclose?bkling
 
Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.ANJALI
 
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners
 
Music Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara Rajendran
Music Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara RajendranMusic Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara Rajendran
Music Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara RajendranTara Rajendran
 
97111 47426 Call Girls In Delhi MUNIRKAA
97111 47426 Call Girls In Delhi MUNIRKAA97111 47426 Call Girls In Delhi MUNIRKAA
97111 47426 Call Girls In Delhi MUNIRKAAjennyeacort
 
Glomerular Filtration and determinants of glomerular filtration .pptx
Glomerular Filtration and  determinants of glomerular filtration .pptxGlomerular Filtration and  determinants of glomerular filtration .pptx
Glomerular Filtration and determinants of glomerular filtration .pptxDr.Nusrat Tariq
 
Pharmaceutical Marketting: Unit-5, Pricing
Pharmaceutical Marketting: Unit-5, PricingPharmaceutical Marketting: Unit-5, Pricing
Pharmaceutical Marketting: Unit-5, PricingArunagarwal328757
 
PNEUMOTHORAX AND ITS MANAGEMENTS.pdf
PNEUMOTHORAX   AND  ITS  MANAGEMENTS.pdfPNEUMOTHORAX   AND  ITS  MANAGEMENTS.pdf
PNEUMOTHORAX AND ITS MANAGEMENTS.pdfDolisha Warbi
 
Culture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptxCulture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptxDr. Dheeraj Kumar
 
Basic principles involved in the traditional systems of medicine PDF.pdf
Basic principles involved in the traditional systems of medicine PDF.pdfBasic principles involved in the traditional systems of medicine PDF.pdf
Basic principles involved in the traditional systems of medicine PDF.pdfDivya Kanojiya
 
call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
PULMONARY EDEMA AND ITS MANAGEMENT.pdf
PULMONARY EDEMA AND  ITS  MANAGEMENT.pdfPULMONARY EDEMA AND  ITS  MANAGEMENT.pdf
PULMONARY EDEMA AND ITS MANAGEMENT.pdfDolisha Warbi
 

Dernier (20)

SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptxSYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy Platform
 
Informed Consent Empowering Healthcare Decision-Making.pptx
Informed Consent Empowering Healthcare Decision-Making.pptxInformed Consent Empowering Healthcare Decision-Making.pptx
Informed Consent Empowering Healthcare Decision-Making.pptx
 
POST NATAL EXERCISES AND ITS IMPACT.pptx
POST NATAL EXERCISES AND ITS IMPACT.pptxPOST NATAL EXERCISES AND ITS IMPACT.pptx
POST NATAL EXERCISES AND ITS IMPACT.pptx
 
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisVarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
 
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptx
COVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptxCOVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptx
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptx
 
Primary headache and facial pain. (2024)
Primary headache and facial pain. (2024)Primary headache and facial pain. (2024)
Primary headache and facial pain. (2024)
 
world health day presentation ppt download
world health day presentation ppt downloadworld health day presentation ppt download
world health day presentation ppt download
 
Let's Talk About It: To Disclose or Not to Disclose?
Let's Talk About It: To Disclose or Not to Disclose?Let's Talk About It: To Disclose or Not to Disclose?
Let's Talk About It: To Disclose or Not to Disclose?
 
Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.
 
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
 
Music Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara Rajendran
Music Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara RajendranMusic Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara Rajendran
Music Therapy's Impact in Palliative Care| IAPCON2024| Dr. Tara Rajendran
 
97111 47426 Call Girls In Delhi MUNIRKAA
97111 47426 Call Girls In Delhi MUNIRKAA97111 47426 Call Girls In Delhi MUNIRKAA
97111 47426 Call Girls In Delhi MUNIRKAA
 
Glomerular Filtration and determinants of glomerular filtration .pptx
Glomerular Filtration and  determinants of glomerular filtration .pptxGlomerular Filtration and  determinants of glomerular filtration .pptx
Glomerular Filtration and determinants of glomerular filtration .pptx
 
Pharmaceutical Marketting: Unit-5, Pricing
Pharmaceutical Marketting: Unit-5, PricingPharmaceutical Marketting: Unit-5, Pricing
Pharmaceutical Marketting: Unit-5, Pricing
 
PNEUMOTHORAX AND ITS MANAGEMENTS.pdf
PNEUMOTHORAX   AND  ITS  MANAGEMENTS.pdfPNEUMOTHORAX   AND  ITS  MANAGEMENTS.pdf
PNEUMOTHORAX AND ITS MANAGEMENTS.pdf
 
Culture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptxCulture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptx
 
Basic principles involved in the traditional systems of medicine PDF.pdf
Basic principles involved in the traditional systems of medicine PDF.pdfBasic principles involved in the traditional systems of medicine PDF.pdf
Basic principles involved in the traditional systems of medicine PDF.pdf
 
call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in aerocity DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
PULMONARY EDEMA AND ITS MANAGEMENT.pdf
PULMONARY EDEMA AND  ITS  MANAGEMENT.pdfPULMONARY EDEMA AND  ITS  MANAGEMENT.pdf
PULMONARY EDEMA AND ITS MANAGEMENT.pdf
 

File formats for Next Generation Sequencing

  • 1. Next Generation Sequencing File Formats. Pierre Lindenbaum @yokofakun pierre.lindenbaum@univ-nantes.fr http://plindenbaum.blogspot.com https://github.com/lindenb/courses Institut du Thorax. Nantes. France September 19, 2014 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 2. You don't need to have a deep knowledge of those formats. (Unless you're doing NGS) Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 3. Understand how people have solved their BIG data problems. Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 4. Why sequencing ? Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 5. Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 6. Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 7. Well, that's a little more complicated ... Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 8. FASTQ Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 9. FASTQ FASTQ: text-based format for storing both a DNA sequence and its corresponding quality scores Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 10. FASTQ FASTQ for single end Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 11. FASTQ FASTQ for paired end Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 12. FASTQ Example @IL31_4368:1:1:996:8507/1 NTGATAAAGTAATGACAAAATAATGACATTATTGTTACTATGGTTACTGTGGGA + (94**0-)*7=06>>><<<<<<22@>6;;;5;6:;63:4?-622647..-.5.% @IL31_4368:1:1:996:21421/1 NAAGTTAATTCTTCATTGTCCATTCCTCTGAAATGATTCAGAAATACTGGTAGT + (**+*2396,@<+<:@@@;;5)<0)69606>4;5>;>6&<102)0*+8:&137; @IL31_4368:1:1:997:10572/1 NAATGTATGTAGACCCTTCACATTCAAAGGCAAATACAATATCATCATGTCTTC + (/9**-0032>:>>9>4@@=>??@@:-66,;>;<;6+;255,1;7>>>>3676' @IL31_4368:1:1:997:15684/1 NGCAATCAATGCTATGATTGATCCTGATGGAACTTTGGAGGCTCTGAACAACAT + ()1,*37766>@@@>?@<?@@:>@0>>><-888>8;>*;966>;;;@8@4,.2. @IL31_4368:1:1:997:15249/1 NCGTTATAATGGAATTATTTTTCTTCCTTTATTTAATGTGTTGACAAAGAGAAC Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 13. FASTQ name @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG Col Brief description EAS139 the unique instrument name 136 the run id FC706VJ the owcell id 2 owcell lane 2104 tile number within the owcell lane 15343 'x'-coordinate of the cluster within the tile 197393 'y'-coordinate of the cluster within the tile 1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only) Y Y if the read fails
  • 14. lter (read is bad), N otherwise 18 0 when none of the control bits are on, otherwise it is an even number ATCACG index sequence Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 15. FASTQ Quality Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 16. FASTQ Quality A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). Qsanger = 10 log10 p Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 17. FASTQ Quality Since a human readable format is desired for SAM, 33 is added to the calculated quality in order to make it a printable character ranging from ! - . Qsanger = 10 log10 p + 33 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 18. Aligned Reads 44187101 44187111 44187121 44187131 44187141 44187151 44187161 44187171 aaatgagccaggtgtggtggtgcacacctatagtcccagctacgcaggaggctgaggtgggaggatcgcttaaacccggc REFERENCE ............................................Y................................... CONSENSUS aaa gagccaggtgtggtggtgcacaccgataggcccagctacgtaggaggctgaggtgggaggatcgcttaaa cggc AAA GAGCCAGGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAA CGGC aaatga CCAGGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCC c aaatgagcc GGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC AAATGAGCCAGG gtggtggtgcacacctatagtcccagcgacgtaggaggctgaggtgggaggatcgcttaaacccggc AAATGAGCCAGGTG ggtggtgcacacctatagtcccagctaagtaggaggctgaggtgggaggatcgctttaacccggc AAATGAGCCAGGTGT GTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCGGGC ACATGAGCCAGGTGTG tggtgcacacctatagtcccagctacgtaggaggctgaggtgggaggatcgcttaaacccggc aaatgagccaggtgtgg GCACACGTAAAGTCCCAGCTACGCAGGAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC CAATGAGCCAGTTGTGG cacacctatagtcccagctacgcacgaggctgaggtgggaggatcgctttaacccggc AAATGAGCCAGGTGAGGT cacacctatagtcccagctacgcaggaggctgaggtgggaggatcgcttaaacccggc AAATGAGCCAGGTGTGGT acacctatagtcccagctacgcaggaggctgaggtgggaggatcgctttaacccggc aaatgagccaggtgtggtgg cctatagtcccagctacgtaggaggctgaggtgggaggatcgcttaaacccggc AAATGAGCCAGGTGTGGTGG TATAGTCCCAGCTACGCAGGAGGCTGAGGTGGTAGGATCGCATAAACCCGGC AAATGAGCCAGGTGTGGTGGT TAGTCCCAGCTACGTAGGAGGCTGAGTTGGGAGGATCTCTTAAACCCGGC aaatgagccaggtgtggtggtg TCGTCCCAGCTACGCAGGAGGCTTAGGTGGGAGGATCGCTTAAACCCGGC aaatgagccaggtgtggtggtgca AGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGGTTAAACCCGGC aaatgagccaggtgtggtggtgcac cccagctacgcaggaggctgaggtgggaccatcgcttaaaccccgc aaatgagccaggtgtggtggtgcac CCAGCTACGTAGTAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 19. SAM Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 20. SAM Format SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 21. SAM Format Is exible enough to store all the alignment information generated by various alignment programs; Is simple enough to be easily generated by alignment programs or converted from existing alignment formats; Is compact in
  • 22. le size; Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory; Allows the
  • 23. le to be indexed by genomic position to eciently retrieve all reads aligning to a locus. Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 24. SAM Format Structure + HEADER -version -program parameters +GENOME - chrom1 size - chrom2 size - chrom3 size - (..) +GROUPS - group1 : sample1, lane 4 - group2 : sample2, lane 1 + BODY - READ1 - group1 - READ2 - group1 - READ3 - group1 - READ4 - group2 Pierre Linde-nbau(m.@y.o.ko)fakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 25. SAM Example Simple example @HD VN:1.5 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; r001 83 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 26. SAM Header Section Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 27. SAM Header @HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e @SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5 @RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 @RG ID:UM0098:2 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L002 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 @PG ID:bwa VN:0.5.4 @PG ID:GATK TableRecalibration VN:1.0.3471 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, CycleCovariate, Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 28. SAM Alignment Section Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 29. SAM Example Simple example IL31_4368:1:1:996:8507 77 * 0 0 * * 0 0 NTGATAAAGTAATGACAAAATAATGACATTATTGTTACTATGGTTACTGTGGGA (94**0-)*7=0622@6;;;5;6:;63:4?-622647..-.5.% IL31_4368:1:1:996:8507 141 * 0 0 * * 0 0 TCCCTTACCCCCAAGCTCCATACCCTCCTAATGCCCACACCTCTTACCTTAGGA FFCEFFFEEFFFFFFFEFFEFFFEFCFCEEFEFFFCEFF;EEFF=FEE?FCE IL31_4368:1:1:996:21421 77 * 0 0 * * 0 0 NAAGTTAATTCTTCATTGTCCATTCCTCTGAAATGATTCAGAAATACTGGTAGT (**+*2396,@+:@@@;;5)0)696064;5;6102)0*+8:137; IL31_4368:1:1:996:21421 141 * 0 0 * * 0 0 CAAAAACTTTCACTTTACCTGCCGGGTTTCCCAGTTTACATTCCACTGTTTGAC DBDDB,B9BAA4AAB7BB?7BBB=91;+*@;587+*=/*@@?9=73=.7)7* IL31_4368:1:1:997:10572 77 * 0 0 * * 0 0 NAATGTATGTAGACCCTTCACATTCAAAGGCAAATACAATATCATCATGTCTTC (/9**-0032:94@@=??@@:-66,;;;6+;255,1;73676' IL31_4368:1:1:997:10572 141 * 0 0 * * 0 0 GATCTTCTGTGACTGGAAGAAAATGTGTTACATATTACATTTCTGTCCCCATTG E?=EECEEEEE98EEEEAEEBD??BE@AEABEEABCEEDECEBDA=DEE IL31_4368:1:1:997:15684 83 chr1 241356612 60 54M = 241356442 -224 ATGTTGTTCAGAGCCTCCAAAGTTCCATCAGGATCAATCATAGCATTGATTGCN IL31_4368:1:1:997:15684 163 chr1 241356442 60 54M = 241356612 224 CAGCCTCAGATTCAGCATTCTCAAATTCAGCTGCGGCTGAAACAGCAGCAGGAC IL31_4368:1:1:997:15249 77 * 0 0 * * 0 0 NCGTTATAATGGAATTATTTTTCTTCCTTTATTTAATGTGTTGACAAAGAGAAC (916928.82@@054;33222224;@2?22;5=;;858*0666 IL31_4368:1:1:997:15249 141 * 0 0 * * 0 0 AATGTTCTGAAACCTCTGAGAAAGCAAATATTTATTTTAATGAAAAATCCTTAT EDEEC;EEE;EEE?EECE;7AEEEEEE07EECEA;D6D+EE4E7EEE4;E=EA IL31_4368:1:1:997:6273 77 * 0 0 * * 0 0 NTACGAAGAAGTATTTCATTGGGAGGAGCTTATCCAAATATTTCCTGTCTATCC (**4*5-*3299+::@2;;853+39;0.3)-)79)..'5.988*200 IL31_4368:1:1:997:6273 141 * 0 0 * * 0 0 ACATTTACCAAGACCAAAGGAAACTTACCTTGCAAGAATTAGACAGTTCATTTG EEAAFFFEEFEFCFAFFAFCCFFEFEFEFFFFB?ABA@ECEE=F@DE@DDF; IL31_4368:1:1:997:1657 83 chr1 143630364 60 54M = 143630066 -352 TACCTTTTTAAAGAGATCTAAAATTGTCACATGGTTATTAGATACAGAGGCCTN IL31_4368:1:1:997:1657 163 chr1 143630066 60 54M = 143630364 352 CCCACCTCTCTCAATGTTTTCCATATGGCAGGGACTCAGCACAGGTGGATTAAT IL31_4368:1:1:997:5609 77 * 0 0 * * 0 0 NGGTGTCTCTTACGGACAGCATTAAGCTAGATTCTTTTTAGACCGATCTGCCAA (*+*,1426;@@??@?9@@@@4?666260.)-*9;;;8:'0418 IL31_4368:1:1:997:5609 141 * 0 0 * * 0 0 TCACTATCAGAAACAGAATGTATAACTTCCAAATCAGTAGGAAACACAAGGAAA AEECECBEC@A;AC=AEEEEAEEEEAC,CE?ECCE9EAEC4E:CAC@EE) IL31_4368:1:1:997:14262 77 * 0 0 * * 0 0 NGAGAACCAATGGGAAGCAGCCTGAGCTGCTGGAACCTATTCCCCATGACTTCA (9136242-2@@@;96.@@@@0$2623.':**+3*03137..--. IL31_4368:1:1:997:14262 141 * 0 0 * * 0 0 TGTTTTTTCTTTTTCTTTTTTTTTTGACAGTGCAGAGATTTTTTATCTTTTTAA 97'2.64.?7/3(891?=(6??6+6++/*..3(:'/'9::''(1.(, IL31_4368:1:1:998:19914 77 * 0 0 * * 0 0 NAGAGCATTGACACACATAAAAAATTAAAACAACCCTTTGTACTTACGGTAGAA (/89255@?7..())@@@;2265267@@8..3;/$ IL31_4368:1:1:998:19914 141 * 0 0 * * 0 0 GAATGAAAGCAGAGACCCTGATCGAGCCCCAGAAAGATACACCTCCAGATTTTA C?=CECE4CD?8@==;EBE=0@:@@92@???6991.?A=@5?@99;971 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 30. SAM Example Sorted SAM One row is one read, NOT one fragment. IL31_4368:1:107:15207:19097 163 chr1 17 0 54M = 21 58 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC IL31_4368:1:107:15207:19097 83 chr1 21 0 54M = 17 -58 ACCCTACCCCTAGCCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA IL31_4368:1:10:17817:9758 137 chr1 23 0 54M = 23 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCAGATC IL31_4368:1:54:13142:21400 163 chr1 37 0 54M = 44 61 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC IL31_4368:1:54:13142:21400 83 chr1 44 0 54M = 37 -61 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 32. cations Record Column Col Field Type Brief description 1 QNAME String Query template NAME 2 FLAG Int bitwise FLAG 3 RNAME String Reference sequence NAME 4 POS Int 1-based leftmost mapping POSition 5 MAPQ Int MAPping Quality 6 CIGAR String CIGAR string 7 RNEXT String Ref. name of the mate/next read 8 PNEXT Int Position of the mate/next read 9 TLEN Int observed Template LENgth 10 SEQ String segment SEQuence 11 QUAL String ASCII of Phred-scaled base QUALity+33 12 META metadata Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 34. cations Record Column Col Field Type 1 QNAME IL31 4368:1:42:12530:7509 2 FLAG 137 3 RNAME chr1 4 POS 10 5 MAPQ 30 6 CIGAR 54M 7 RNEXT = 8 PNEXT 100 9 TLEN 90 10 SEQ TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGATC 11 QUAL GGGGGGGFEGGGGCFGGGGGEGGFGEGGFGFGGFGFEGFCFFBECCBDACB@?B 12 META XT:A:R NM:i:3 SM:i:0 AM:i:0 X0:i:11 X1:i:0 XM:i:3 XO:i:0 XG:i:0 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 35. SAM FLAGS read paired. read mapped in proper pair. read unmapped. mate unmapped. read reverse strand. mate reverse strand.
  • 36. rst in pair. second in pair. not primary alignment. read fails platform/vendor quality checks. read is PCR or optical duplicate. supplementary alignment Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 37. SAM FLAGS Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 38. SAM FLAGS Read Paired Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 39. SAM FLAGS Read mapped in proper pair Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 40. Read mapped in proper pair Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 41. SAM FLAGS Read unmapped Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 42. SAM FLAGS Mate unmapped Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 43. SAM FLAGS Read reverse strand Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 44. SAM FLAGS Mate reverse strand Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 45. SAM FLAGS First in pair Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 46. SAM FLAGS Second in pair Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 47. SAM FLAGS not primary alignment Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 48. SAM FLAGS read fails platform/vendor quality checks Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 49. SAM FLAGS read is PCR or optical duplicate Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 50. SAM FLAGS supplementary alignment Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 51. SAM CIGAR The CIGAR string is a sequence of of base lengths and the associated operation. They are used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference. Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 52. SAM Cigar Op BAM Description M 0 alignment match (can be a sequence match or mismatch) I 1 insertion to the reference D 2 deletion from the reference N 3 skipped region from the reference S 4 soft clipping (clipped sequences present in SEQ) H 5 hard clipping (clipped sequences NOT present in SEQ) P 6 padding (silent deletion from padded reference) = 7 sequence match X 8 sequence mismatch Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 53. SAM Cigar http://genome.sph.umich.edu/wiki/SAM RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Reference: C C A T A C T G A A C T G A C T A A C Read: ACTAGAATGGCT Aligning these two: RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Reference: C C A T A C T G A A C T G A C T A A C Read: A C T A G A A T G G C T With the alignment above, you get: POS: 5 CIGAR: 3M1I3M1D5M or CIGAR: 3=1I3=1D2=1X2= Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 54. SAM Cigar Soft Clip Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 55. SAM Cigar Hard Clip Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 56. SAM Fomat optional TAGs optional
  • 57. elds on a SAM/BAM Alignment. A TAG is comprised of a two character TAG key, they type of the value, and the value: [A-Za-z][A-za-z]:[AifZH]:.* The types, A, i, f, Z, H are used to indicate the type of value stored in the tag. Type Description A character i signed 32-bit integer f single-precision oat Z string H hex string Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 58. SAM Fomat optional TAGs XT:A:U - user de
  • 59. ned tag called XT. It holds a character. The value associated with this tag is 'U'. NM:i:2 - prede
  • 60. ned tag NM means: Edit distance to the reference (number of changes necessary to make this equal the reference, excluding clipping) Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 61. SAM Example Sorted SAM IL31_4368:1:107:15207:19097 163 chr1 17 0 54M = 21 58 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC IL31_4368:1:107:15207:19097 83 chr1 21 0 54M = 17 -58 ACCCTACCCCTAGCCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA IL31_4368:1:54:13142:21400 163 chr1 37 0 54M = 44 61 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC IL31_4368:1:54:13142:21400 83 chr1 44 0 54M = 37 -61 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 62. BAM Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 63. BGZF Format The SAM/BAM
  • 64. le format (Sequence Alignment/Map) comes in a plain text format (SAM), and a compressed binary format (BAM). The latter uses a modi
  • 65. ed form of gzip compression called BGZF (Blocked GNU Zip Format), which can be applied to any
  • 66. le format to provide compression with ecient random access Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 67. BAM INDEX Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 68. CRAM Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 69. VCF Variant Call Format Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 70. VCF Format VCF is a text
  • 71. le format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 72. VCF Example ##f i l e f o rma t=VCFv4 . 0 ##f i l eDa t e =20090805 ##s o u r c e=myImputat ionProgramV3 . 1 ##r e f e r e n c e =1000GenomesPi lotNCBI36 ##p h a s i n g=p a r t i a l ##INFO=ID=NS, Number=1,Type=I n t e g e r , De s c r i p t i o n=Number o f Samples With Data ##INFO=ID=DP, Number=1,Type=I n t e g e r , De s c r i p t i o n=Tot a l Depth ##INFO=ID=AF, Number=. ,Type=Floa t , De s c r i p t i o n=A l l e l e Fr equency ##INFO=ID=AA, Number=1,Type=St r i n g , De s c r i p t i o n=An c e s t r a l A l l e l e ##INFO=ID=DB, Number=0,Type=Flag , De s c r i p t i o n=dbSNP membership , b u i l d 129 ##INFO=ID=H2 , Number=0,Type=Flag , De s c r i p t i o n=HapMap2 membership ##FILTER=ID=q10 , De s c r i p t i o n=Qu a l i t y below 10 ##FILTER=ID=s50 , De s c r i p t i o n=Le s s than 50% o f sampl e s have data ##FORMAT=ID=GT, Number=1,Type=St r i n g , De s c r i p t i o n=Genotype ##FORMAT=ID=GQ, Number=1,Type=I n t e g e r , De s c r i p t i o n=Genotype Qu a l i t y ##FORMAT=ID=DP, Number=1,Type=I n t e g e r , De s c r i p t i o n=Read Depth ##FORMAT=ID=HQ, Number=2,Type=I n t e g e r , De s c r i p t i o n=Haplot ype Qu a l i t y #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 r s6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1 j 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , . 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0 j 1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3 20 1110696 r s6040355 A G,T 67 PASS NS=2;DP=10;AF=0 . 3 3 3 , 0 . 6 6 7 ;AA=T;DB GT:GQ:DP:HQ 2 / 2 : 3 5 : 4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 j 0 : 5 4 : 7 : 5 6 , 6 0 0 j 0 : 4 8 : 4 : 5 1 , 5 1 0 / 0 : 6 1 : 2 20 1234567 mi c r o s a t 1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0 / 1 : 3 5 : 4 0 / 2 : 1 7 : 2 1 / 1 : 4 0 : 3 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 73. VCF Column CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE-1 SAMPLE-2 SAMPLE-3 ... (...) Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 75. elds should be described as follows ##INFO=ID=ID , Number=number , Type=type , De s c r i p t i o n= d e s c r i p t i o n ( . . . ) ##INFO=ID= NS ,Number=1,Type=I n t e g e r , De s c r i p t i o n=Number o f Samples With Data ( . . . ) INFO FORMAT NA00001 NA00002 NA00003 20 14370 r s6054257 G A 29 PASS NS=3 ;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1 j 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , . 20 17330 . T A 3 q10 NS=3 ;DP=11;AF=0.017 GT:GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0 j 1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 76. VCF FILTERs FILTERs that have been applied to the data should be described as follows: ##FILTER=ID=ID , De s c r i p t i o n= d e s c r i p t i o n ( . . . ) ##FILTER=ID=q10 , De s c r i p t i o n=Qu a l i t y below 10 ##FILTER=ID=s50 , De s c r i p t i o n=Le s s than 50 p e r c e n t o f sampl e s have data ( . . . ) #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 r s6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1 j 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , . 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0 j 1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3 20 1110696 r s6040355 A G,T 67 PASS NS=2;DP=10;AF=0 . 3 3 3 , 0 . 6 6 7 ;AA=T;DB GT:GQ:DP:HQ 2 / 2 : 3 5 : 4 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 79. ed in the FORMAT
  • 80. eld should be described as follows: ##FORMAT=ID=ID , Number=number , Type=type , De s c r i p t i o n= d e s c r i p t i o n ( . . . ) ##FORMAT=ID=GT ,Number=1,Type=St r i n g , De s c r i p t i o n=Genotype ( . . . ) #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 r s6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT :GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1/0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , . 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT :GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0/1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3 20 1110696 r s6040355 A G,T 67 PASS NS=2;DP=10;AF=0 . 3 3 3 , 0 . 6 6 7 ;AA=T;DB GT:GQ:DP:HQ 2 / 2 : 3 5 : 4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT :GQ:DP:HQ 0 j 0 : 5 4 : 7 : 5 6 , 6 0 0/0 : 4 8 : 4 : 5 1 , 5 1 0 / 0 : 6 1 : 2 20 1234567 mi c r o s a t 1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT :GQ:DP 0 / 1 : 3 5 : 4 0/2 : 1 7 : 2 1 / 1 : 4 0 : 3 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 81. Tabix Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 82. Binning Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 83. Tabix INDEX Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 84. Building the TABIX index $ b g z i p f f i l e . v c f $ t a b i x p v c f f i l e . v c f . gz Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 85. Querying the TABIX index $ t a b i x f i l e . v c f . gz chr3 :1235456778 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 86. API Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 87. Reading SAM with the samtools C library #inc lude s t d l i b . h #inc lude s t d i o . h #inc lude bam. h #inc lude sam. h i n t main ( i n t argc , char a rgv [ ] ) f s amf i l e t sam=samopen ( a rgv [ 1 ] , rb , 0 ) ; bam1 t b= b am i n i t 1 ( ) ; long n=0L ; whi le ( samread ( sam , b ) 0) f i f ( ! ( bc o r e . f l a gBAM FUNMAP) ) ++n ; g bam de s t roy1 ( b ) ; s amc l o s e ( sam ) ; p r i n t f ( %l u nn , n ) ; return 0 ; g Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 88. Reading SAM with the java picard library import j a v a . i o . F i l e ; import ne t . s f . s amt o o l s . ; publ i c c l a s s CountMapped f publ i c s t a t i c void main ( S t r i n g [ ] a r g s ) f long n=0L ; F i l e f=new F i l e ( a r g s [ 0 ] ) ; SAMFi leReader sam = new SAMFi leReader ( f ) ; for ( SAMRecord r e c : sam) f i f ( ! r e c . getReadUnmapped ( ) ) f ++n ; g g sam. c l o s e ( ) ; System . out . p r i n t l n ( n ) ; g Pierre Lgindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 89. End Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 90. Credits Angus: http://ged.msu.edu/angus/ Wikipedia: https://en.wikibooks.org/wiki/C%2B%2B_ Programming/Programming_Languages/C%2B%2B/Code/ Statements/Variables Abecasis Group Wiki: http://genome.sph.umich.edu/wiki/SAM Genome Research http://genome.cshlp.org/content/12/6/996 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses