SlideShare une entreprise Scribd logo
1  sur  73
Next Generation Sequencing 
File Formats. 
Pierre Lindenbaum 
@yokofakun 
pierre.lindenbaum@univ-nantes.fr 
http://plindenbaum.blogspot.com 
https://github.com/lindenb/courses 
Institut du Thorax. Nantes. France 
September 19, 2014 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
You don't need to have a deep knowledge of those formats. 
(Unless you're doing NGS) 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Understand how people have solved their BIG data problems. 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Why sequencing ? 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Well, that's a little more complicated ... 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ 
FASTQ: text-based format for storing both a DNA sequence and 
its corresponding quality scores 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ 
FASTQ for single end 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ 
FASTQ for paired end 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ Example 
@IL31_4368:1:1:996:8507/1 
NTGATAAAGTAATGACAAAATAATGACATTATTGTTACTATGGTTACTGTGGGA 
+ 
(94**0-)*7=06>>><<<<<<22@>6;;;5;6:;63:4?-622647..-.5.% 
@IL31_4368:1:1:996:21421/1 
NAAGTTAATTCTTCATTGTCCATTCCTCTGAAATGATTCAGAAATACTGGTAGT 
+ 
(**+*2396,@<+<:@@@;;5)<0)69606>4;5>;>6&<102)0*+8:&137; 
@IL31_4368:1:1:997:10572/1 
NAATGTATGTAGACCCTTCACATTCAAAGGCAAATACAATATCATCATGTCTTC 
+ 
(/9**-0032>:>>9>4@@=>??@@:-66,;>;<;6+;255,1;7>>>>3676' 
@IL31_4368:1:1:997:15684/1 
NGCAATCAATGCTATGATTGATCCTGATGGAACTTTGGAGGCTCTGAACAACAT 
+ 
()1,*37766>@@@>?@<?@@:>@0>>><-888>8;>*;966>;;;@8@4,.2. 
@IL31_4368:1:1:997:15249/1 
NCGTTATAATGGAATTATTTTTCTTCCTTTATTTAATGTGTTGACAAAGAGAAC 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ name 
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG 
Col Brief description 
EAS139 the unique instrument name 
136 the run id 
FC706VJ the 
owcell id 
2 
owcell lane 
2104 tile number within the 
owcell lane 
15343 'x'-coordinate of the cluster within the tile 
197393 'y'-coordinate of the cluster within the tile 
1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only) 
Y Y if the read fails
lter (read is bad), N otherwise 
18 0 when none of the control bits are on, otherwise it is an even number 
ATCACG index sequence 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ Quality 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ Quality 
A quality value Q is an integer mapping of p (i.e., the probability 
that the corresponding base call is incorrect). 
Qsanger = 10 log10 p 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
FASTQ Quality 
Since a human readable format is desired for SAM, 33 is added to 
the calculated quality in order to make it a printable character 
ranging from ! - . 
Qsanger = 10 log10 p + 33 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Aligned Reads 
44187101 44187111 44187121 44187131 44187141 44187151 44187161 44187171 
aaatgagccaggtgtggtggtgcacacctatagtcccagctacgcaggaggctgaggtgggaggatcgcttaaacccggc REFERENCE 
............................................Y................................... CONSENSUS 
aaa gagccaggtgtggtggtgcacaccgataggcccagctacgtaggaggctgaggtgggaggatcgcttaaa cggc 
AAA GAGCCAGGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAA CGGC 
aaatga CCAGGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCC c 
aaatgagcc GGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC 
AAATGAGCCAGG gtggtggtgcacacctatagtcccagcgacgtaggaggctgaggtgggaggatcgcttaaacccggc 
AAATGAGCCAGGTG ggtggtgcacacctatagtcccagctaagtaggaggctgaggtgggaggatcgctttaacccggc 
AAATGAGCCAGGTGT GTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCGGGC 
ACATGAGCCAGGTGTG tggtgcacacctatagtcccagctacgtaggaggctgaggtgggaggatcgcttaaacccggc 
aaatgagccaggtgtgg GCACACGTAAAGTCCCAGCTACGCAGGAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC 
CAATGAGCCAGTTGTGG cacacctatagtcccagctacgcacgaggctgaggtgggaggatcgctttaacccggc 
AAATGAGCCAGGTGAGGT cacacctatagtcccagctacgcaggaggctgaggtgggaggatcgcttaaacccggc 
AAATGAGCCAGGTGTGGT acacctatagtcccagctacgcaggaggctgaggtgggaggatcgctttaacccggc 
aaatgagccaggtgtggtgg cctatagtcccagctacgtaggaggctgaggtgggaggatcgcttaaacccggc 
AAATGAGCCAGGTGTGGTGG TATAGTCCCAGCTACGCAGGAGGCTGAGGTGGTAGGATCGCATAAACCCGGC 
AAATGAGCCAGGTGTGGTGGT TAGTCCCAGCTACGTAGGAGGCTGAGTTGGGAGGATCTCTTAAACCCGGC 
aaatgagccaggtgtggtggtg TCGTCCCAGCTACGCAGGAGGCTTAGGTGGGAGGATCGCTTAAACCCGGC 
aaatgagccaggtgtggtggtgca AGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGGTTAAACCCGGC 
aaatgagccaggtgtggtggtgcac cccagctacgcaggaggctgaggtgggaccatcgcttaaaccccgc 
aaatgagccaggtgtggtggtgcac CCAGCTACGTAGTAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Format 
SAM (Sequence Alignment/Map) format is a generic format for 
storing large nucleotide sequence alignments 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Format 
Is 
exible enough to store all the alignment information 
generated by various alignment programs; 
Is simple enough to be easily generated by alignment 
programs or converted from existing alignment formats; 
Is compact in
le size; 
Allows most of operations on the alignment to work on a 
stream without loading the whole alignment into memory; 
Allows the
le to be indexed by genomic position to eciently 
retrieve all reads aligning to a locus. 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Format 
Structure 
+ HEADER 
-version 
-program parameters 
+GENOME 
- chrom1 size 
- chrom2 size 
- chrom3 size 
- (..) 
+GROUPS 
- group1 : sample1, lane 4 
- group2 : sample2, lane 1 
+ BODY 
- READ1 - group1 
- READ2 - group1 
- READ3 - group1 
- READ4 - group2 
Pierre Linde-nbau(m.@y.o.ko)fakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Example 
Simple example 
@HD VN:1.5 SO:coordinate 
@SQ SN:ref LN:45 
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * 
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * 
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; 
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * 
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; 
r001 83 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Header Section 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Header 
@HD VN:1.0 SO:coordinate 
@SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 
@SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e 
@SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5 
@RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 @RG ID:UM0098:2 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L002 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 @PG ID:bwa VN:0.5.4 
@PG ID:GATK TableRecalibration VN:1.0.3471 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, CycleCovariate, Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Alignment Section 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Example 
Simple example 
IL31_4368:1:1:996:8507 77 * 0 0 * * 0 0 NTGATAAAGTAATGACAAAATAATGACATTATTGTTACTATGGTTACTGTGGGA (94**0-)*7=0622@6;;;5;6:;63:4?-622647..-.5.% 
IL31_4368:1:1:996:8507 141 * 0 0 * * 0 0 TCCCTTACCCCCAAGCTCCATACCCTCCTAATGCCCACACCTCTTACCTTAGGA FFCEFFFEEFFFFFFFEFFEFFFEFCFCEEFEFFFCEFF;EEFF=FEE?FCE 
IL31_4368:1:1:996:21421 77 * 0 0 * * 0 0 NAAGTTAATTCTTCATTGTCCATTCCTCTGAAATGATTCAGAAATACTGGTAGT (**+*2396,@+:@@@;;5)0)696064;5;6102)0*+8:137; 
IL31_4368:1:1:996:21421 141 * 0 0 * * 0 0 CAAAAACTTTCACTTTACCTGCCGGGTTTCCCAGTTTACATTCCACTGTTTGAC DBDDB,B9BAA4AAB7BB?7BBB=91;+*@;587+*=/*@@?9=73=.7)7* 
IL31_4368:1:1:997:10572 77 * 0 0 * * 0 0 NAATGTATGTAGACCCTTCACATTCAAAGGCAAATACAATATCATCATGTCTTC (/9**-0032:94@@=??@@:-66,;;;6+;255,1;73676' 
IL31_4368:1:1:997:10572 141 * 0 0 * * 0 0 GATCTTCTGTGACTGGAAGAAAATGTGTTACATATTACATTTCTGTCCCCATTG E?=EECEEEEE98EEEEAEEBD??BE@AEABEEABCEEDECEBDA=DEE 
IL31_4368:1:1:997:15684 83 chr1 241356612 60 54M = 241356442 -224 ATGTTGTTCAGAGCCTCCAAAGTTCCATCAGGATCAATCATAGCATTGATTGCN IL31_4368:1:1:997:15684 163 chr1 241356442 60 54M = 241356612 224 CAGCCTCAGATTCAGCATTCTCAAATTCAGCTGCGGCTGAAACAGCAGCAGGAC IL31_4368:1:1:997:15249 77 * 0 0 * * 0 0 NCGTTATAATGGAATTATTTTTCTTCCTTTATTTAATGTGTTGACAAAGAGAAC (916928.82@@054;33222224;@2?22;5=;;858*0666 
IL31_4368:1:1:997:15249 141 * 0 0 * * 0 0 AATGTTCTGAAACCTCTGAGAAAGCAAATATTTATTTTAATGAAAAATCCTTAT EDEEC;EEE;EEE?EECE;7AEEEEEE07EECEA;D6D+EE4E7EEE4;E=EA 
IL31_4368:1:1:997:6273 77 * 0 0 * * 0 0 NTACGAAGAAGTATTTCATTGGGAGGAGCTTATCCAAATATTTCCTGTCTATCC (**4*5-*3299+::@2;;853+39;0.3)-)79)..'5.988*200 
IL31_4368:1:1:997:6273 141 * 0 0 * * 0 0 ACATTTACCAAGACCAAAGGAAACTTACCTTGCAAGAATTAGACAGTTCATTTG EEAAFFFEEFEFCFAFFAFCCFFEFEFEFFFFB?ABA@ECEE=F@DE@DDF; 
IL31_4368:1:1:997:1657 83 chr1 143630364 60 54M = 143630066 -352 TACCTTTTTAAAGAGATCTAAAATTGTCACATGGTTATTAGATACAGAGGCCTN IL31_4368:1:1:997:1657 163 chr1 143630066 60 54M = 143630364 352 CCCACCTCTCTCAATGTTTTCCATATGGCAGGGACTCAGCACAGGTGGATTAAT IL31_4368:1:1:997:5609 77 * 0 0 * * 0 0 NGGTGTCTCTTACGGACAGCATTAAGCTAGATTCTTTTTAGACCGATCTGCCAA (*+*,1426;@@??@?9@@@@4?666260.)-*9;;;8:'0418 
IL31_4368:1:1:997:5609 141 * 0 0 * * 0 0 TCACTATCAGAAACAGAATGTATAACTTCCAAATCAGTAGGAAACACAAGGAAA AEECECBEC@A;AC=AEEEEAEEEEAC,CE?ECCE9EAEC4E:CAC@EE) 
IL31_4368:1:1:997:14262 77 * 0 0 * * 0 0 NGAGAACCAATGGGAAGCAGCCTGAGCTGCTGGAACCTATTCCCCATGACTTCA (9136242-2@@@;96.@@@@0$2623.':**+3*03137..--. 
IL31_4368:1:1:997:14262 141 * 0 0 * * 0 0 TGTTTTTTCTTTTTCTTTTTTTTTTGACAGTGCAGAGATTTTTTATCTTTTTAA 97'2.64.?7/3(891?=(6??6+6++/*..3(:'/'9::''(1.(, 
IL31_4368:1:1:998:19914 77 * 0 0 * * 0 0 NAGAGCATTGACACACATAAAAAATTAAAACAACCCTTTGTACTTACGGTAGAA (/89255@?7..())@@@;2265267@@8..3;/$ 
IL31_4368:1:1:998:19914 141 * 0 0 * * 0 0 GAATGAAAGCAGAGACCCTGATCGAGCCCCAGAAAGATACACCTCCAGATTTTA C?=CECE4CD?8@==;EBE=0@:@@92@???6991.?A=@5?@99;971 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Example 
Sorted SAM 
One row is one read, NOT one fragment. 
IL31_4368:1:107:15207:19097 163 chr1 17 0 54M = 21 58 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC IL31_4368:1:107:15207:19097 83 chr1 21 0 54M = 17 -58 ACCCTACCCCTAGCCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA IL31_4368:1:10:17817:9758 137 chr1 23 0 54M = 23 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCAGATC IL31_4368:1:54:13142:21400 163 chr1 37 0 54M = 44 61 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC IL31_4368:1:54:13142:21400 83 chr1 44 0 54M = 37 -61 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Speci
cations 
Record Column 
Col Field Type Brief description 
1 QNAME String Query template NAME 
2 FLAG Int bitwise FLAG 
3 RNAME String Reference sequence NAME 
4 POS Int 1-based leftmost mapping POSition 
5 MAPQ Int MAPping Quality 
6 CIGAR String CIGAR string 
7 RNEXT String Ref. name of the mate/next read 
8 PNEXT Int Position of the mate/next read 
9 TLEN Int observed Template LENgth 
10 SEQ String segment SEQuence 
11 QUAL String ASCII of Phred-scaled base QUALity+33 
12 META metadata 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Speci
cations 
Record Column 
Col Field Type 1 QNAME IL31 4368:1:42:12530:7509 
2 FLAG 137 
3 RNAME chr1 
4 POS 10 
5 MAPQ 30 
6 CIGAR 54M 
7 RNEXT = 
8 PNEXT 100 
9 TLEN 90 
10 SEQ TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGATC 
11 QUAL GGGGGGGFEGGGGCFGGGGGEGGFGEGGFGFGGFGFEGFCFFBECCBDACB@?B 
12 META XT:A:R NM:i:3 SM:i:0 AM:i:0 X0:i:11 X1:i:0 XM:i:3 XO:i:0 XG:i:0 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
 read paired. 
 read mapped in proper pair. 
 read unmapped. 
 mate unmapped. 
 read reverse strand. 
 mate reverse strand.
rst in pair. 
 second in pair. 
 not primary alignment. 
 read fails platform/vendor quality checks. 
 read is PCR or optical duplicate. 
 supplementary alignment 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Read Paired 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Read mapped in proper pair 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
Read mapped in proper pair 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Read unmapped 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Mate unmapped 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Read reverse strand 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Mate reverse strand 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
First in pair 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
Second in pair 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
not primary alignment 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
read fails platform/vendor quality checks 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
read is PCR or optical duplicate 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM FLAGS 
supplementary alignment 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM CIGAR 
The CIGAR string is a sequence of of base lengths and the 
associated operation. They are used to indicate things like which 
bases align (either a match/mismatch) with the reference, are 
deleted from the reference, and are insertions that are not in the 
reference. 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Cigar 
Op BAM Description 
M 0 alignment match (can be a sequence match or mismatch) 
I 1 insertion to the reference 
D 2 deletion from the reference 
N 3 skipped region from the reference 
S 4 soft clipping (clipped sequences present in SEQ) 
H 5 hard clipping (clipped sequences NOT present in SEQ) 
P 6 padding (silent deletion from padded reference) 
= 7 sequence match 
X 8 sequence mismatch 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Cigar 
http://genome.sph.umich.edu/wiki/SAM 
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 
Reference: C C A T A C T G A A C T G A C T A A C 
Read: ACTAGAATGGCT 
Aligning these two: 
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 
Reference: C C A T A C T G A A C T G A C T A A C 
Read: A C T A G A A T G G C T 
With the alignment above, you get: 
POS: 5 
CIGAR: 3M1I3M1D5M 
or 
CIGAR: 3=1I3=1D2=1X2= 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Cigar 
Soft Clip 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Cigar 
Hard Clip 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Fomat 
optional TAGs 
optional
elds on a SAM/BAM Alignment. A TAG is comprised of 
a two character TAG key, they type of the value, and the value: 
[A-Za-z][A-za-z]:[AifZH]:.* 
The types, A, i, f, Z, H are used to indicate the type of value 
stored in the tag. 
Type Description 
A character 
i signed 32-bit integer 
f single-precision 
oat 
Z string 
H hex string 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Fomat 
optional TAGs 
XT:A:U - user de
ned tag called XT. It holds a character. 
The value associated with this tag is 'U'. 
NM:i:2 - prede
ned tag NM means: Edit distance to the 
reference (number of changes necessary to make this equal 
the reference, excluding clipping) 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
SAM Example 
Sorted SAM 
IL31_4368:1:107:15207:19097 163 chr1 17 0 54M = 21 58 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC IL31_4368:1:107:15207:19097 83 chr1 21 0 54M = 17 -58 ACCCTACCCCTAGCCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA IL31_4368:1:54:13142:21400 163 chr1 37 0 54M = 44 61 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC IL31_4368:1:54:13142:21400 83 chr1 44 0 54M = 37 -61 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
BAM 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
BGZF Format 
The SAM/BAM
le format (Sequence Alignment/Map) comes in a 
plain text format (SAM), and a compressed binary format (BAM). 
The latter uses a modi
ed form of gzip compression called BGZF 
(Blocked GNU Zip Format), which can be applied to any
le 
format to provide compression with ecient random access 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
BAM INDEX 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
CRAM 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
VCF 
Variant Call Format 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
VCF Format 
VCF is a text
le format (most likely stored in a compressed 
manner). It contains meta-information lines, a header line, and 
then data lines each containing information about a position in the 
genome. 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
VCF 
Example 
##f i l e f o rma t=VCFv4 . 0 
##f i l eDa t e =20090805 
##s o u r c e=myImputat ionProgramV3 . 1 
##r e f e r e n c e =1000GenomesPi lotNCBI36 
##p h a s i n g=p a r t i a l 
##INFO=ID=NS, Number=1,Type=I n t e g e r , De s c r i p t i o n=Number o f Samples With Data 
##INFO=ID=DP, Number=1,Type=I n t e g e r , De s c r i p t i o n=Tot a l Depth 
##INFO=ID=AF, Number=. ,Type=Floa t , De s c r i p t i o n=A l l e l e Fr equency 
##INFO=ID=AA, Number=1,Type=St r i n g , De s c r i p t i o n=An c e s t r a l A l l e l e  
##INFO=ID=DB, Number=0,Type=Flag , De s c r i p t i o n=dbSNP membership , b u i l d 129 
##INFO=ID=H2 , Number=0,Type=Flag , De s c r i p t i o n=HapMap2 membership 
##FILTER=ID=q10 , De s c r i p t i o n=Qu a l i t y below 10 
##FILTER=ID=s50 , De s c r i p t i o n=Le s s than 50% o f sampl e s have data 
##FORMAT=ID=GT, Number=1,Type=St r i n g , De s c r i p t i o n=Genotype 
##FORMAT=ID=GQ, Number=1,Type=I n t e g e r , De s c r i p t i o n=Genotype Qu a l i t y 
##FORMAT=ID=DP, Number=1,Type=I n t e g e r , De s c r i p t i o n=Read Depth 
##FORMAT=ID=HQ, Number=2,Type=I n t e g e r , De s c r i p t i o n=Haplot ype Qu a l i t y 
#CHROM POS ID REF ALT QUAL FILTER INFO 
FORMAT NA00001 NA00002 NA00003 
20 14370 r s6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 
GT:GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1 j 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , . 
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 
GT:GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0 j 1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3 
20 1110696 r s6040355 A G,T 67 PASS NS=2;DP=10;AF=0 . 3 3 3 , 0 . 6 6 7 ;AA=T;DB GT:GQ:DP:HQ 2 / 2 : 3 5 : 4 
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T 
GT:GQ:DP:HQ 0 j 0 : 5 4 : 7 : 5 6 , 6 0 0 j 0 : 4 8 : 4 : 5 1 , 5 1 0 / 0 : 6 1 : 2 
20 1234567 mi c r o s a t 1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G 
GT:GQ:DP 0 / 1 : 3 5 : 4 0 / 2 : 1 7 : 2 1 / 1 : 4 0 : 3 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
VCF 
Column 
CHROM 
POS 
ID 
REF 
ALT 
QUAL 
FILTER 
INFO 
FORMAT 
SAMPLE-1 
SAMPLE-2 
SAMPLE-3 
... 
(...) 
Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses

Contenu connexe

Tendances

Next generation sequencing
Next  generation  sequencingNext  generation  sequencing
Next generation sequencingNidhi Singh
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function predictionLars Juhl Jensen
 
Comparative genomic hybridization
Comparative genomic hybridizationComparative genomic hybridization
Comparative genomic hybridizationvlmawia
 
Quality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withQuality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withHafiz Muhammad Zeeshan Raza
 
Multiple Sequence Alignment Tool Using NCBI COBALT
Multiple Sequence Alignment Tool Using NCBI COBALTMultiple Sequence Alignment Tool Using NCBI COBALT
Multiple Sequence Alignment Tool Using NCBI COBALTMohsin Raza
 
Global and local alignment in Bioinformatics
Global and local alignment in BioinformaticsGlobal and local alignment in Bioinformatics
Global and local alignment in BioinformaticsMahmudul Alam
 
T coffee algorithm dissection
T coffee algorithm dissectionT coffee algorithm dissection
T coffee algorithm dissectionGui Chen
 
Dynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentDynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentGeethanjaliAnilkumar2
 
Global and Local Sequence Alignment
Global and Local Sequence AlignmentGlobal and Local Sequence Alignment
Global and Local Sequence AlignmentAjayPatil210
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
Single cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applicationsSingle cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applicationsfaraharooj
 
Needleman wunsch computional ppt
Needleman wunsch computional pptNeedleman wunsch computional ppt
Needleman wunsch computional ppttarun shekhawat
 
Sequence similarity tools.pptx
Sequence similarity tools.pptxSequence similarity tools.pptx
Sequence similarity tools.pptxPagudalaSangeetha
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation SequencingArindam Ghosh
 

Tendances (20)

Next generation sequencing
Next  generation  sequencingNext  generation  sequencing
Next generation sequencing
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
Comparative genomic hybridization
Comparative genomic hybridizationComparative genomic hybridization
Comparative genomic hybridization
 
Quality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withQuality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained with
 
Multiple Sequence Alignment Tool Using NCBI COBALT
Multiple Sequence Alignment Tool Using NCBI COBALTMultiple Sequence Alignment Tool Using NCBI COBALT
Multiple Sequence Alignment Tool Using NCBI COBALT
 
Sanger sequencing method of DNA
Sanger sequencing method of DNA Sanger sequencing method of DNA
Sanger sequencing method of DNA
 
Global and local alignment in Bioinformatics
Global and local alignment in BioinformaticsGlobal and local alignment in Bioinformatics
Global and local alignment in Bioinformatics
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
T coffee algorithm dissection
T coffee algorithm dissectionT coffee algorithm dissection
T coffee algorithm dissection
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Dynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentDynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignment
 
Global and Local Sequence Alignment
Global and Local Sequence AlignmentGlobal and Local Sequence Alignment
Global and Local Sequence Alignment
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Single cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applicationsSingle cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applications
 
Needleman wunsch computional ppt
Needleman wunsch computional pptNeedleman wunsch computional ppt
Needleman wunsch computional ppt
 
Sequence similarity tools.pptx
Sequence similarity tools.pptxSequence similarity tools.pptx
Sequence similarity tools.pptx
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 

Similaire à File formats for Next Generation Sequencing

Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learningtrygub
 
Message-passing concurrency in Python
Message-passing concurrency in PythonMessage-passing concurrency in Python
Message-passing concurrency in PythonSarah Mount
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Anand Sampat
 
Pythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxPythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxMrHackerxD
 
(Slightly) Smarter Smart Pointers
(Slightly) Smarter Smart Pointers(Slightly) Smarter Smart Pointers
(Slightly) Smarter Smart PointersCarlo Pescio
 
Pypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequelPypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequelMark Rees
 
Operating System Practice : Meeting 2-basic commands linux operating system-s...
Operating System Practice : Meeting 2-basic commands linux operating system-s...Operating System Practice : Meeting 2-basic commands linux operating system-s...
Operating System Practice : Meeting 2-basic commands linux operating system-s...Syaiful Ahdan
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2
 
Tracing python applications
Tracing python applicationsTracing python applications
Tracing python applicationsNikolay Stoitsev
 
Using TypeScript at Dashlane
Using TypeScript at DashlaneUsing TypeScript at Dashlane
Using TypeScript at DashlaneDashlane
 
Brogramming - Python, Bash for Data Processing, and Git
Brogramming - Python, Bash for Data Processing, and GitBrogramming - Python, Bash for Data Processing, and Git
Brogramming - Python, Bash for Data Processing, and GitRon Reiter
 
Parallel computing in Python: Current state and recent advances
Parallel computing in Python: Current state and recent advancesParallel computing in Python: Current state and recent advances
Parallel computing in Python: Current state and recent advancesPierre Glaser
 
Why Python is better for Data Science
Why Python is better for Data ScienceWhy Python is better for Data Science
Why Python is better for Data ScienceÍcaro Medeiros
 
The Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerThe Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerSasha Goldshtein
 
MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.agramfort
 
A CTF Hackers Toolbox
A CTF Hackers ToolboxA CTF Hackers Toolbox
A CTF Hackers ToolboxStefan
 

Similaire à File formats for Next Generation Sequencing (20)

Introduction to Linux
Introduction to LinuxIntroduction to Linux
Introduction to Linux
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learning
 
Message-passing concurrency in Python
Message-passing concurrency in PythonMessage-passing concurrency in Python
Message-passing concurrency in Python
 
Qt Translations
Qt TranslationsQt Translations
Qt Translations
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
 
Pythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxPythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptx
 
Pyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdfPyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdf
 
(Slightly) Smarter Smart Pointers
(Slightly) Smarter Smart Pointers(Slightly) Smarter Smart Pointers
(Slightly) Smarter Smart Pointers
 
Pypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequelPypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequel
 
Operating System Practice : Meeting 2-basic commands linux operating system-s...
Operating System Practice : Meeting 2-basic commands linux operating system-s...Operating System Practice : Meeting 2-basic commands linux operating system-s...
Operating System Practice : Meeting 2-basic commands linux operating system-s...
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
Tracing python applications
Tracing python applicationsTracing python applications
Tracing python applications
 
Using TypeScript at Dashlane
Using TypeScript at DashlaneUsing TypeScript at Dashlane
Using TypeScript at Dashlane
 
Brogramming - Python, Bash for Data Processing, and Git
Brogramming - Python, Bash for Data Processing, and GitBrogramming - Python, Bash for Data Processing, and Git
Brogramming - Python, Bash for Data Processing, and Git
 
Parallel computing in Python: Current state and recent advances
Parallel computing in Python: Current state and recent advancesParallel computing in Python: Current state and recent advances
Parallel computing in Python: Current state and recent advances
 
Why Python is better for Data Science
Why Python is better for Data ScienceWhy Python is better for Data Science
Why Python is better for Data Science
 
The Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerThe Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF Primer
 
MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.
 
A CTF Hackers Toolbox
A CTF Hackers ToolboxA CTF Hackers Toolbox
A CTF Hackers Toolbox
 

Plus de Pierre Lindenbaum

Plus de Pierre Lindenbaum (20)

Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !
 
"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)
 
Advanced NCBI
Advanced NCBI Advanced NCBI
Advanced NCBI
 
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebookBuilding a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
 
Make
MakeMake
Make
 
XML for bioinformatics
XML for bioinformaticsXML for bioinformatics
XML for bioinformatics
 
20120423.NGS.Rennes
20120423.NGS.Rennes20120423.NGS.Rennes
20120423.NGS.Rennes
 
Sketching 20120412
Sketching 20120412Sketching 20120412
Sketching 20120412
 
Introduction to mongodb for bioinformatics
Introduction to mongodb for bioinformaticsIntroduction to mongodb for bioinformatics
Introduction to mongodb for bioinformatics
 
Biostar17037
Biostar17037Biostar17037
Biostar17037
 
Tweeting for the BioStar Paper
Tweeting for the BioStar PaperTweeting for the BioStar Paper
Tweeting for the BioStar Paper
 
Variation Toolkit
Variation ToolkitVariation Toolkit
Variation Toolkit
 
Bioinformatician 2.0
Bioinformatician 2.0Bioinformatician 2.0
Bioinformatician 2.0
 
Analyzing Exome Data with KNIME
Analyzing Exome Data with KNIMEAnalyzing Exome Data with KNIME
Analyzing Exome Data with KNIME
 
NOTCH2 backstage
NOTCH2 backstageNOTCH2 backstage
NOTCH2 backstage
 
Bioinfo tweets
Bioinfo tweetsBioinfo tweets
Bioinfo tweets
 
Post doctoriales 2011
Post doctoriales 2011Post doctoriales 2011
Post doctoriales 2011
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course
 
MyWordle.java
MyWordle.javaMyWordle.java
MyWordle.java
 
Biblio2.0
Biblio2.0Biblio2.0
Biblio2.0
 

Dernier

Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...narwatsonia7
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingCall Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingNehru place Escorts
 
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original PhotosBook Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photosnarwatsonia7
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowNehru place Escorts
 
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service MumbaiLow Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbaisonalikaur4
 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...narwatsonia7
 
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
Call Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Call Girl Lucknow Mallika 7001305949 Independent Escort Service LucknowCall Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Call Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknownarwatsonia7
 
Pharmaceutical Marketting: Unit-5, Pricing
Pharmaceutical Marketting: Unit-5, PricingPharmaceutical Marketting: Unit-5, Pricing
Pharmaceutical Marketting: Unit-5, PricingArunagarwal328757
 
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...narwatsonia7
 
Call Girls Viman Nagar 7001305949 All Area Service COD available Any Time
Call Girls Viman Nagar 7001305949 All Area Service COD available Any TimeCall Girls Viman Nagar 7001305949 All Area Service COD available Any Time
Call Girls Viman Nagar 7001305949 All Area Service COD available Any Timevijaych2041
 
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service ChennaiCall Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service ChennaiNehru place Escorts
 
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort ServiceCollege Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort ServiceNehru place Escorts
 
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformKweku Zurek
 
97111 47426 Call Girls In Delhi MUNIRKAA
97111 47426 Call Girls In Delhi MUNIRKAA97111 47426 Call Girls In Delhi MUNIRKAA
97111 47426 Call Girls In Delhi MUNIRKAAjennyeacort
 
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...narwatsonia7
 
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking ModelsMumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Modelssonalikaur4
 

Dernier (20)

Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
 
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
 
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingCall Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
 
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original PhotosBook Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
 
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service MumbaiLow Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
 
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
Call Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Call Girl Lucknow Mallika 7001305949 Independent Escort Service LucknowCall Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Call Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
 
Pharmaceutical Marketting: Unit-5, Pricing
Pharmaceutical Marketting: Unit-5, PricingPharmaceutical Marketting: Unit-5, Pricing
Pharmaceutical Marketting: Unit-5, Pricing
 
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...
 
Call Girls Viman Nagar 7001305949 All Area Service COD available Any Time
Call Girls Viman Nagar 7001305949 All Area Service COD available Any TimeCall Girls Viman Nagar 7001305949 All Area Service COD available Any Time
Call Girls Viman Nagar 7001305949 All Area Service COD available Any Time
 
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service ChennaiCall Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
 
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort ServiceCollege Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
 
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy Platform
 
97111 47426 Call Girls In Delhi MUNIRKAA
97111 47426 Call Girls In Delhi MUNIRKAA97111 47426 Call Girls In Delhi MUNIRKAA
97111 47426 Call Girls In Delhi MUNIRKAA
 
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
 
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking ModelsMumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
 

File formats for Next Generation Sequencing

  • 1. Next Generation Sequencing File Formats. Pierre Lindenbaum @yokofakun pierre.lindenbaum@univ-nantes.fr http://plindenbaum.blogspot.com https://github.com/lindenb/courses Institut du Thorax. Nantes. France September 19, 2014 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 2. You don't need to have a deep knowledge of those formats. (Unless you're doing NGS) Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 3. Understand how people have solved their BIG data problems. Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 4. Why sequencing ? Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 5. Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 6. Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 7. Well, that's a little more complicated ... Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 8. FASTQ Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 9. FASTQ FASTQ: text-based format for storing both a DNA sequence and its corresponding quality scores Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 10. FASTQ FASTQ for single end Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 11. FASTQ FASTQ for paired end Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 12. FASTQ Example @IL31_4368:1:1:996:8507/1 NTGATAAAGTAATGACAAAATAATGACATTATTGTTACTATGGTTACTGTGGGA + (94**0-)*7=06>>><<<<<<22@>6;;;5;6:;63:4?-622647..-.5.% @IL31_4368:1:1:996:21421/1 NAAGTTAATTCTTCATTGTCCATTCCTCTGAAATGATTCAGAAATACTGGTAGT + (**+*2396,@<+<:@@@;;5)<0)69606>4;5>;>6&<102)0*+8:&137; @IL31_4368:1:1:997:10572/1 NAATGTATGTAGACCCTTCACATTCAAAGGCAAATACAATATCATCATGTCTTC + (/9**-0032>:>>9>4@@=>??@@:-66,;>;<;6+;255,1;7>>>>3676' @IL31_4368:1:1:997:15684/1 NGCAATCAATGCTATGATTGATCCTGATGGAACTTTGGAGGCTCTGAACAACAT + ()1,*37766>@@@>?@<?@@:>@0>>><-888>8;>*;966>;;;@8@4,.2. @IL31_4368:1:1:997:15249/1 NCGTTATAATGGAATTATTTTTCTTCCTTTATTTAATGTGTTGACAAAGAGAAC Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 13. FASTQ name @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG Col Brief description EAS139 the unique instrument name 136 the run id FC706VJ the owcell id 2 owcell lane 2104 tile number within the owcell lane 15343 'x'-coordinate of the cluster within the tile 197393 'y'-coordinate of the cluster within the tile 1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only) Y Y if the read fails
  • 14. lter (read is bad), N otherwise 18 0 when none of the control bits are on, otherwise it is an even number ATCACG index sequence Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 15. FASTQ Quality Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 16. FASTQ Quality A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). Qsanger = 10 log10 p Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 17. FASTQ Quality Since a human readable format is desired for SAM, 33 is added to the calculated quality in order to make it a printable character ranging from ! - . Qsanger = 10 log10 p + 33 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 18. Aligned Reads 44187101 44187111 44187121 44187131 44187141 44187151 44187161 44187171 aaatgagccaggtgtggtggtgcacacctatagtcccagctacgcaggaggctgaggtgggaggatcgcttaaacccggc REFERENCE ............................................Y................................... CONSENSUS aaa gagccaggtgtggtggtgcacaccgataggcccagctacgtaggaggctgaggtgggaggatcgcttaaa cggc AAA GAGCCAGGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAA CGGC aaatga CCAGGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCC c aaatgagcc GGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC AAATGAGCCAGG gtggtggtgcacacctatagtcccagcgacgtaggaggctgaggtgggaggatcgcttaaacccggc AAATGAGCCAGGTG ggtggtgcacacctatagtcccagctaagtaggaggctgaggtgggaggatcgctttaacccggc AAATGAGCCAGGTGT GTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCGGGC ACATGAGCCAGGTGTG tggtgcacacctatagtcccagctacgtaggaggctgaggtgggaggatcgcttaaacccggc aaatgagccaggtgtgg GCACACGTAAAGTCCCAGCTACGCAGGAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC CAATGAGCCAGTTGTGG cacacctatagtcccagctacgcacgaggctgaggtgggaggatcgctttaacccggc AAATGAGCCAGGTGAGGT cacacctatagtcccagctacgcaggaggctgaggtgggaggatcgcttaaacccggc AAATGAGCCAGGTGTGGT acacctatagtcccagctacgcaggaggctgaggtgggaggatcgctttaacccggc aaatgagccaggtgtggtgg cctatagtcccagctacgtaggaggctgaggtgggaggatcgcttaaacccggc AAATGAGCCAGGTGTGGTGG TATAGTCCCAGCTACGCAGGAGGCTGAGGTGGTAGGATCGCATAAACCCGGC AAATGAGCCAGGTGTGGTGGT TAGTCCCAGCTACGTAGGAGGCTGAGTTGGGAGGATCTCTTAAACCCGGC aaatgagccaggtgtggtggtg TCGTCCCAGCTACGCAGGAGGCTTAGGTGGGAGGATCGCTTAAACCCGGC aaatgagccaggtgtggtggtgca AGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGGTTAAACCCGGC aaatgagccaggtgtggtggtgcac cccagctacgcaggaggctgaggtgggaccatcgcttaaaccccgc aaatgagccaggtgtggtggtgcac CCAGCTACGTAGTAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 19. SAM Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 20. SAM Format SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 21. SAM Format Is exible enough to store all the alignment information generated by various alignment programs; Is simple enough to be easily generated by alignment programs or converted from existing alignment formats; Is compact in
  • 22. le size; Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory; Allows the
  • 23. le to be indexed by genomic position to eciently retrieve all reads aligning to a locus. Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 24. SAM Format Structure + HEADER -version -program parameters +GENOME - chrom1 size - chrom2 size - chrom3 size - (..) +GROUPS - group1 : sample1, lane 4 - group2 : sample2, lane 1 + BODY - READ1 - group1 - READ2 - group1 - READ3 - group1 - READ4 - group2 Pierre Linde-nbau(m.@y.o.ko)fakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 25. SAM Example Simple example @HD VN:1.5 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; r001 83 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 26. SAM Header Section Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 27. SAM Header @HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e @SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5 @RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 @RG ID:UM0098:2 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L002 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 @PG ID:bwa VN:0.5.4 @PG ID:GATK TableRecalibration VN:1.0.3471 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, CycleCovariate, Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 28. SAM Alignment Section Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 29. SAM Example Simple example IL31_4368:1:1:996:8507 77 * 0 0 * * 0 0 NTGATAAAGTAATGACAAAATAATGACATTATTGTTACTATGGTTACTGTGGGA (94**0-)*7=0622@6;;;5;6:;63:4?-622647..-.5.% IL31_4368:1:1:996:8507 141 * 0 0 * * 0 0 TCCCTTACCCCCAAGCTCCATACCCTCCTAATGCCCACACCTCTTACCTTAGGA FFCEFFFEEFFFFFFFEFFEFFFEFCFCEEFEFFFCEFF;EEFF=FEE?FCE IL31_4368:1:1:996:21421 77 * 0 0 * * 0 0 NAAGTTAATTCTTCATTGTCCATTCCTCTGAAATGATTCAGAAATACTGGTAGT (**+*2396,@+:@@@;;5)0)696064;5;6102)0*+8:137; IL31_4368:1:1:996:21421 141 * 0 0 * * 0 0 CAAAAACTTTCACTTTACCTGCCGGGTTTCCCAGTTTACATTCCACTGTTTGAC DBDDB,B9BAA4AAB7BB?7BBB=91;+*@;587+*=/*@@?9=73=.7)7* IL31_4368:1:1:997:10572 77 * 0 0 * * 0 0 NAATGTATGTAGACCCTTCACATTCAAAGGCAAATACAATATCATCATGTCTTC (/9**-0032:94@@=??@@:-66,;;;6+;255,1;73676' IL31_4368:1:1:997:10572 141 * 0 0 * * 0 0 GATCTTCTGTGACTGGAAGAAAATGTGTTACATATTACATTTCTGTCCCCATTG E?=EECEEEEE98EEEEAEEBD??BE@AEABEEABCEEDECEBDA=DEE IL31_4368:1:1:997:15684 83 chr1 241356612 60 54M = 241356442 -224 ATGTTGTTCAGAGCCTCCAAAGTTCCATCAGGATCAATCATAGCATTGATTGCN IL31_4368:1:1:997:15684 163 chr1 241356442 60 54M = 241356612 224 CAGCCTCAGATTCAGCATTCTCAAATTCAGCTGCGGCTGAAACAGCAGCAGGAC IL31_4368:1:1:997:15249 77 * 0 0 * * 0 0 NCGTTATAATGGAATTATTTTTCTTCCTTTATTTAATGTGTTGACAAAGAGAAC (916928.82@@054;33222224;@2?22;5=;;858*0666 IL31_4368:1:1:997:15249 141 * 0 0 * * 0 0 AATGTTCTGAAACCTCTGAGAAAGCAAATATTTATTTTAATGAAAAATCCTTAT EDEEC;EEE;EEE?EECE;7AEEEEEE07EECEA;D6D+EE4E7EEE4;E=EA IL31_4368:1:1:997:6273 77 * 0 0 * * 0 0 NTACGAAGAAGTATTTCATTGGGAGGAGCTTATCCAAATATTTCCTGTCTATCC (**4*5-*3299+::@2;;853+39;0.3)-)79)..'5.988*200 IL31_4368:1:1:997:6273 141 * 0 0 * * 0 0 ACATTTACCAAGACCAAAGGAAACTTACCTTGCAAGAATTAGACAGTTCATTTG EEAAFFFEEFEFCFAFFAFCCFFEFEFEFFFFB?ABA@ECEE=F@DE@DDF; IL31_4368:1:1:997:1657 83 chr1 143630364 60 54M = 143630066 -352 TACCTTTTTAAAGAGATCTAAAATTGTCACATGGTTATTAGATACAGAGGCCTN IL31_4368:1:1:997:1657 163 chr1 143630066 60 54M = 143630364 352 CCCACCTCTCTCAATGTTTTCCATATGGCAGGGACTCAGCACAGGTGGATTAAT IL31_4368:1:1:997:5609 77 * 0 0 * * 0 0 NGGTGTCTCTTACGGACAGCATTAAGCTAGATTCTTTTTAGACCGATCTGCCAA (*+*,1426;@@??@?9@@@@4?666260.)-*9;;;8:'0418 IL31_4368:1:1:997:5609 141 * 0 0 * * 0 0 TCACTATCAGAAACAGAATGTATAACTTCCAAATCAGTAGGAAACACAAGGAAA AEECECBEC@A;AC=AEEEEAEEEEAC,CE?ECCE9EAEC4E:CAC@EE) IL31_4368:1:1:997:14262 77 * 0 0 * * 0 0 NGAGAACCAATGGGAAGCAGCCTGAGCTGCTGGAACCTATTCCCCATGACTTCA (9136242-2@@@;96.@@@@0$2623.':**+3*03137..--. IL31_4368:1:1:997:14262 141 * 0 0 * * 0 0 TGTTTTTTCTTTTTCTTTTTTTTTTGACAGTGCAGAGATTTTTTATCTTTTTAA 97'2.64.?7/3(891?=(6??6+6++/*..3(:'/'9::''(1.(, IL31_4368:1:1:998:19914 77 * 0 0 * * 0 0 NAGAGCATTGACACACATAAAAAATTAAAACAACCCTTTGTACTTACGGTAGAA (/89255@?7..())@@@;2265267@@8..3;/$ IL31_4368:1:1:998:19914 141 * 0 0 * * 0 0 GAATGAAAGCAGAGACCCTGATCGAGCCCCAGAAAGATACACCTCCAGATTTTA C?=CECE4CD?8@==;EBE=0@:@@92@???6991.?A=@5?@99;971 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 30. SAM Example Sorted SAM One row is one read, NOT one fragment. IL31_4368:1:107:15207:19097 163 chr1 17 0 54M = 21 58 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC IL31_4368:1:107:15207:19097 83 chr1 21 0 54M = 17 -58 ACCCTACCCCTAGCCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA IL31_4368:1:10:17817:9758 137 chr1 23 0 54M = 23 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCAGATC IL31_4368:1:54:13142:21400 163 chr1 37 0 54M = 44 61 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC IL31_4368:1:54:13142:21400 83 chr1 44 0 54M = 37 -61 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 32. cations Record Column Col Field Type Brief description 1 QNAME String Query template NAME 2 FLAG Int bitwise FLAG 3 RNAME String Reference sequence NAME 4 POS Int 1-based leftmost mapping POSition 5 MAPQ Int MAPping Quality 6 CIGAR String CIGAR string 7 RNEXT String Ref. name of the mate/next read 8 PNEXT Int Position of the mate/next read 9 TLEN Int observed Template LENgth 10 SEQ String segment SEQuence 11 QUAL String ASCII of Phred-scaled base QUALity+33 12 META metadata Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 34. cations Record Column Col Field Type 1 QNAME IL31 4368:1:42:12530:7509 2 FLAG 137 3 RNAME chr1 4 POS 10 5 MAPQ 30 6 CIGAR 54M 7 RNEXT = 8 PNEXT 100 9 TLEN 90 10 SEQ TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGATC 11 QUAL GGGGGGGFEGGGGCFGGGGGEGGFGEGGFGFGGFGFEGFCFFBECCBDACB@?B 12 META XT:A:R NM:i:3 SM:i:0 AM:i:0 X0:i:11 X1:i:0 XM:i:3 XO:i:0 XG:i:0 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 35. SAM FLAGS read paired. read mapped in proper pair. read unmapped. mate unmapped. read reverse strand. mate reverse strand.
  • 36. rst in pair. second in pair. not primary alignment. read fails platform/vendor quality checks. read is PCR or optical duplicate. supplementary alignment Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 37. SAM FLAGS Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 38. SAM FLAGS Read Paired Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 39. SAM FLAGS Read mapped in proper pair Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 40. Read mapped in proper pair Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 41. SAM FLAGS Read unmapped Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 42. SAM FLAGS Mate unmapped Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 43. SAM FLAGS Read reverse strand Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 44. SAM FLAGS Mate reverse strand Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 45. SAM FLAGS First in pair Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 46. SAM FLAGS Second in pair Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 47. SAM FLAGS not primary alignment Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 48. SAM FLAGS read fails platform/vendor quality checks Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 49. SAM FLAGS read is PCR or optical duplicate Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 50. SAM FLAGS supplementary alignment Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 51. SAM CIGAR The CIGAR string is a sequence of of base lengths and the associated operation. They are used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference. Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 52. SAM Cigar Op BAM Description M 0 alignment match (can be a sequence match or mismatch) I 1 insertion to the reference D 2 deletion from the reference N 3 skipped region from the reference S 4 soft clipping (clipped sequences present in SEQ) H 5 hard clipping (clipped sequences NOT present in SEQ) P 6 padding (silent deletion from padded reference) = 7 sequence match X 8 sequence mismatch Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 53. SAM Cigar http://genome.sph.umich.edu/wiki/SAM RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Reference: C C A T A C T G A A C T G A C T A A C Read: ACTAGAATGGCT Aligning these two: RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Reference: C C A T A C T G A A C T G A C T A A C Read: A C T A G A A T G G C T With the alignment above, you get: POS: 5 CIGAR: 3M1I3M1D5M or CIGAR: 3=1I3=1D2=1X2= Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 54. SAM Cigar Soft Clip Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 55. SAM Cigar Hard Clip Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 56. SAM Fomat optional TAGs optional
  • 57. elds on a SAM/BAM Alignment. A TAG is comprised of a two character TAG key, they type of the value, and the value: [A-Za-z][A-za-z]:[AifZH]:.* The types, A, i, f, Z, H are used to indicate the type of value stored in the tag. Type Description A character i signed 32-bit integer f single-precision oat Z string H hex string Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 58. SAM Fomat optional TAGs XT:A:U - user de
  • 59. ned tag called XT. It holds a character. The value associated with this tag is 'U'. NM:i:2 - prede
  • 60. ned tag NM means: Edit distance to the reference (number of changes necessary to make this equal the reference, excluding clipping) Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 61. SAM Example Sorted SAM IL31_4368:1:107:15207:19097 163 chr1 17 0 54M = 21 58 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC IL31_4368:1:107:15207:19097 83 chr1 21 0 54M = 17 -58 ACCCTACCCCTAGCCCTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA IL31_4368:1:54:13142:21400 163 chr1 37 0 54M = 44 61 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC IL31_4368:1:54:13142:21400 83 chr1 44 0 54M = 37 -61 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 62. BAM Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 63. BGZF Format The SAM/BAM
  • 64. le format (Sequence Alignment/Map) comes in a plain text format (SAM), and a compressed binary format (BAM). The latter uses a modi
  • 65. ed form of gzip compression called BGZF (Blocked GNU Zip Format), which can be applied to any
  • 66. le format to provide compression with ecient random access Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 67. BAM INDEX Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 68. CRAM Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 69. VCF Variant Call Format Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 70. VCF Format VCF is a text
  • 71. le format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 72. VCF Example ##f i l e f o rma t=VCFv4 . 0 ##f i l eDa t e =20090805 ##s o u r c e=myImputat ionProgramV3 . 1 ##r e f e r e n c e =1000GenomesPi lotNCBI36 ##p h a s i n g=p a r t i a l ##INFO=ID=NS, Number=1,Type=I n t e g e r , De s c r i p t i o n=Number o f Samples With Data ##INFO=ID=DP, Number=1,Type=I n t e g e r , De s c r i p t i o n=Tot a l Depth ##INFO=ID=AF, Number=. ,Type=Floa t , De s c r i p t i o n=A l l e l e Fr equency ##INFO=ID=AA, Number=1,Type=St r i n g , De s c r i p t i o n=An c e s t r a l A l l e l e ##INFO=ID=DB, Number=0,Type=Flag , De s c r i p t i o n=dbSNP membership , b u i l d 129 ##INFO=ID=H2 , Number=0,Type=Flag , De s c r i p t i o n=HapMap2 membership ##FILTER=ID=q10 , De s c r i p t i o n=Qu a l i t y below 10 ##FILTER=ID=s50 , De s c r i p t i o n=Le s s than 50% o f sampl e s have data ##FORMAT=ID=GT, Number=1,Type=St r i n g , De s c r i p t i o n=Genotype ##FORMAT=ID=GQ, Number=1,Type=I n t e g e r , De s c r i p t i o n=Genotype Qu a l i t y ##FORMAT=ID=DP, Number=1,Type=I n t e g e r , De s c r i p t i o n=Read Depth ##FORMAT=ID=HQ, Number=2,Type=I n t e g e r , De s c r i p t i o n=Haplot ype Qu a l i t y #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 r s6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1 j 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , . 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0 j 1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3 20 1110696 r s6040355 A G,T 67 PASS NS=2;DP=10;AF=0 . 3 3 3 , 0 . 6 6 7 ;AA=T;DB GT:GQ:DP:HQ 2 / 2 : 3 5 : 4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0 j 0 : 5 4 : 7 : 5 6 , 6 0 0 j 0 : 4 8 : 4 : 5 1 , 5 1 0 / 0 : 6 1 : 2 20 1234567 mi c r o s a t 1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0 / 1 : 3 5 : 4 0 / 2 : 1 7 : 2 1 / 1 : 4 0 : 3 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 73. VCF Column CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE-1 SAMPLE-2 SAMPLE-3 ... (...) Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 75. elds should be described as follows ##INFO=ID=ID , Number=number , Type=type , De s c r i p t i o n= d e s c r i p t i o n ( . . . ) ##INFO=ID= NS ,Number=1,Type=I n t e g e r , De s c r i p t i o n=Number o f Samples With Data ( . . . ) INFO FORMAT NA00001 NA00002 NA00003 20 14370 r s6054257 G A 29 PASS NS=3 ;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1 j 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , . 20 17330 . T A 3 q10 NS=3 ;DP=11;AF=0.017 GT:GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0 j 1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 76. VCF FILTERs FILTERs that have been applied to the data should be described as follows: ##FILTER=ID=ID , De s c r i p t i o n= d e s c r i p t i o n ( . . . ) ##FILTER=ID=q10 , De s c r i p t i o n=Qu a l i t y below 10 ##FILTER=ID=s50 , De s c r i p t i o n=Le s s than 50 p e r c e n t o f sampl e s have data ( . . . ) #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 r s6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1 j 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , . 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0 j 1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3 20 1110696 r s6040355 A G,T 67 PASS NS=2;DP=10;AF=0 . 3 3 3 , 0 . 6 6 7 ;AA=T;DB GT:GQ:DP:HQ 2 / 2 : 3 5 : 4 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 79. ed in the FORMAT
  • 80. eld should be described as follows: ##FORMAT=ID=ID , Number=number , Type=type , De s c r i p t i o n= d e s c r i p t i o n ( . . . ) ##FORMAT=ID=GT ,Number=1,Type=St r i n g , De s c r i p t i o n=Genotype ( . . . ) #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 r s6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT :GQ:DP:HQ 0 j 0 : 4 8 : 1 : 5 1 , 5 1 1/0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , . 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT :GQ:DP:HQ 0 j 0 : 4 9 : 3 : 5 8 , 5 0 0/1 : 3 : 5 : 6 5 , 3 0 / 0 : 4 1 : 3 20 1110696 r s6040355 A G,T 67 PASS NS=2;DP=10;AF=0 . 3 3 3 , 0 . 6 6 7 ;AA=T;DB GT:GQ:DP:HQ 2 / 2 : 3 5 : 4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT :GQ:DP:HQ 0 j 0 : 5 4 : 7 : 5 6 , 6 0 0/0 : 4 8 : 4 : 5 1 , 5 1 0 / 0 : 6 1 : 2 20 1234567 mi c r o s a t 1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT :GQ:DP 0 / 1 : 3 5 : 4 0/2 : 1 7 : 2 1 / 1 : 4 0 : 3 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 81. Tabix Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 82. Binning Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 83. Tabix INDEX Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 84. Building the TABIX index $ b g z i p f f i l e . v c f $ t a b i x p v c f f i l e . v c f . gz Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 85. Querying the TABIX index $ t a b i x f i l e . v c f . gz chr3 :1235456778 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 86. API Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 87. Reading SAM with the samtools C library #inc lude s t d l i b . h #inc lude s t d i o . h #inc lude bam. h #inc lude sam. h i n t main ( i n t argc , char a rgv [ ] ) f s amf i l e t sam=samopen ( a rgv [ 1 ] , rb , 0 ) ; bam1 t b= b am i n i t 1 ( ) ; long n=0L ; whi le ( samread ( sam , b ) 0) f i f ( ! ( bc o r e . f l a gBAM FUNMAP) ) ++n ; g bam de s t roy1 ( b ) ; s amc l o s e ( sam ) ; p r i n t f ( %l u nn , n ) ; return 0 ; g Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 88. Reading SAM with the java picard library import j a v a . i o . F i l e ; import ne t . s f . s amt o o l s . ; publ i c c l a s s CountMapped f publ i c s t a t i c void main ( S t r i n g [ ] a r g s ) f long n=0L ; F i l e f=new F i l e ( a r g s [ 0 ] ) ; SAMFi leReader sam = new SAMFi leReader ( f ) ; for ( SAMRecord r e c : sam) f i f ( ! r e c . getReadUnmapped ( ) ) f ++n ; g g sam. c l o s e ( ) ; System . out . p r i n t l n ( n ) ; g Pierre Lgindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 89. End Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses
  • 90. Credits Angus: http://ged.msu.edu/angus/ Wikipedia: https://en.wikibooks.org/wiki/C%2B%2B_ Programming/Programming_Languages/C%2B%2B/Code/ Statements/Variables Abecasis Group Wiki: http://genome.sph.umich.edu/wiki/SAM Genome Research http://genome.cshlp.org/content/12/6/996 Pierre Lindenbaum@yokofakun pierre.lindenbaum@univ-nantes.fr httNp:e/x/tpGleinnedreantiboanumS.ebqluoegnscpinogtF.icleomFhotrtmpas:t/s/. github.com/lindenb/courses