SlideShare une entreprise Scribd logo
Cleaning Illumina reads
Torsten Seemann
VLSCI :: LSCC - Lab Meeting - Fri 23 Nov 2012
Motivation
Illumina reads
● Usually 100 or 150 bp
○ 250bp rolling out now, 400bp next year
● Indel errors rare
● Homopolymer errors rare
● Substitution errors < 1%
○ Error rate higher at 3' end
● Adaptor issues
○ rare in HiSeq (TruSeq prep)
○ more common in MiSeq (Nextera prep)
● Very high quality overall
Illumina libraries
●"single end"
○just a shotgun read sequenced from one end
●"paired end"
○~500bp fragments sequenced at both ends
○very reliable
●"mate pair"
○circularized 2-10 kbp fragments sequencing
○then "paired end" protocol
○reliability varies
● garbage reads
○ instrument weirdness
● duplicate reads
○ low complexity library, PCR duplicate
● adaptor read-through
○ fragment too short
● indel errors
○ skipping bases, inserting extra bases
● uncalled base
○ couldn't reliably estimate, replace with "N"
● substitution errors
○ reading wrong base
Sequences have errors
More common
Less common
Why clean reads?
● Erroneous data may cause software to:
○ run more slowly
○ use more RAM
○ produce poor / biased / incorrect results
● Cleaning can:
○ improve overall average quality of the reads
■ hopefully giving a reliable result
○ reduce the volume of reads
■ some algorithms are O(N.logN) or O(N2
)
■ enable processing when otherwise couldn't
● (some software does handle them appropriately)
The Q in FASTQ
DNA sequence quality
● DNA sequences often have a
quality value associated with each
nucleotide
● A measure of reliability for each base
○ as it is derived from physical process
■ chromatogram (Sanger sequencing)
■ pH reading (Ion Torrent sequencing)
● Formalised by the Phred software for the
Human Genome Project
Phred qualities
● Q = -10 log10
P <=> P = 10- Q / 10
○ Q = Phred quality score
○ P = probability of base call being incorrect
Quality value Chance it is wrong Accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
Illumina quality plot
Y-axis is "Phred" quality values (higher is better)
FASTQ format
Anatomy of a FASTQ entry
@read00179
AGTCTGATATGCTGTACCTATTATAATTCTAGGCGCTCAT
GCCCGCGGATATCGTAGCTATATGCTTCA
+
8;ACCCD?DD???@B9<9<CAC@=AAAA8A;B<A@882,+
495;;3990,02..-&-&-*,,,,(0**#
Start symbol
Sequence ID
Sequence
Separator line
Encoded quality values,
one symbol per nucleotide
FASTQ quality encoding
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
| | | | |
Q0 Q10 Q20 Q30 Q40
bad maybe ok good excellent
Uses letters/symbols to represent numbers:
Mnemonic:
"swear" words!
Mnemonic:
Q20 = '2' and '0'
Mnemonic:
'HI' = high Q
Spoiled for choice
Solexa
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefgh
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
| | | | |
33 59 64 73 104
Illumina 1.3
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefgh
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
| | | | |
33 59 64 73 104
Illumina 1.5
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefgh
.................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
| | | | |
33 59 64 73 104
Illumina 1.8 (Sanger)
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefgh
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...............................
| | | | |
33 59 64 73 104
What can we clean?
Ambiguous bases
● If there is ambiguity in the base call, an "N" is used
@ILLUMINA:6:1:964:115#GATCAG/1
GGACCTGAGAGTGTGCATGAAGAGGGCAGCGCGCACNGCATCA
+
HHHGFGEEECDEBA@BBBA<=<;:98743720&,+**_%$#"!
● Possible software responses:
○ Crash!
○ Ignore it
○ Silently convert to fixed or random base (Velvet > 'A')
○ Handle it appropriately (Bowtie2)
● Small proportion overall, safer to discard
Homopolymers
● A read consisting of all the same base
@ILLUMINA:6:1:964:115#GATCAG/1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
HHHGFGEEECDEBA@BBBA<=<;:98743720&,+**_%$#"!
● Often occur from clusters at edge of flowcell lane
● Early Illumina software called '?' as 'A' rather than N
● Unlikely to be present in real DNA
● Best to discard
● Less common with newer Illumina OLB software
Quality trimming
●Remove low quality sequence
○Q=13 corresponds to 5% error (p=0.05)
○Q=0..13 encoded by: !"#@%&'()*+,-/
@ILLUMINA:6:1:9646:1115#GATCAG/1
GGACCTGAGAGTGTGCATGAAGAGGGCAGCCCCGCACTGCATG
+
HHHGFGEEECDEBA@BBBA<=<;:98743720&,+**_%$#"!
●Can trim per
○each base
○window moving average eg. 3 base mean
○minimum % good per window eg. need 4 of 5
Illumina Adaptors
● Used in the sequencing chemistry
● Can appear at ends of read sequences
● Worse for mate-pair than for paired-end reads
● PCR Primer
CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
● Genomic DNA Sequencing Primer
CACTCTTTCCCTACACGACGCTCTTCCGATCT
● All other TruSeq & Nextera Adaptors
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Adaptor clipping
● Method
○ Align 3' and 5' read end against all adaptor sequences
○ If there is an anchored "match", trim the read
● Minimum length of match?
○ want to remove adaptor, but not real sequence [10 bp]
● Allow substitutions in match?
○ as reads have errors, need some tolerance [1 sub]
● Allow gaps/indels in match?
○ indels are unlikely in Illumina reads [no]
● Slow to perform compared to other preprocessing steps
Decloning
●Illumina "mate pair" sequencing
○Requires a lot of starting DNA
○Challenging protocol to implement reliably
○Not enough final DNA leads to PCR duplicates
○Coverage is highly non-uniform and sporadic
○Causes bias in analyses
●Decloning
○Replace clones with a single representative
○Choose representative with highest quality
○Helps salvage usable information content
Read length
●Enforce a minimum read length L
●k-mer based tools
○break reads into k-mers, so L < k is pointless
○Velvet, Trinity, Gossamer, ...
●Uniqueness
○desire reasonable uniqueness of sequence,
otherwise multiple mapping will take forever!
○L=24 is bare minimum (I reckon)
○BWA, Bowtie, Shrimp, ...
Other strategies
●Digital normalisation
○remove low frequency k-mers
○remove reads w/ too many low freq k-mers
●Error correction
○replace low frequency k-mers with their
"closest" high-frequency k-mer
○other methods I don't understand yet
●Just use local alignment to "solve" the problem
Examples
Walk-through
1. Original read + quality =43bp
GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA
hhhhghfeefaa^a^[[[^X[[XX^^^^`SSTQPZZBBBBBBB
2. Homopolymer? No
GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA
3. Ambiguious N bases? Yes, 1
GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA
4. Quality < 20 ? Yes, at 3' end
GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA
5. Adaptor sequences > 8bp ? Yes, 9 bp at 5'
end
GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA
6. Combine all masks Logical intersection
GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA
7. Extract longest sub-sequence =19bp
TGACCATGATTCAAGGAAC
The coral genome
● Raw data (A.millepora Illumina)
○ 9 libraries - 3 x PE, 6 x MP - 200bp to 10kbp
○ 92.0 Gbp, 943M reads, average length 98bp
● Method
○ Decloned all MP libs, disallow Ns, reject homopolymers,
trim Q < 20 + clip adaptors, minimum length 55bp
● Cleaned data
○ 42.5 Gbp, 478M reads, average length 88bp
● Effect
○ Good - de novo Velvet assembly improved overall
■ validated by RNA-Seq, ESTs
○ Bad - lower coverage
■ coral is very heterozygous > 10%
Per library yields (Gbp)
●Library Raw Cleaned %Kept
●pe_193 9.55 6.71 70
●pe_463 19.19 13.89 72
●pe_580 4.87 3.18 65
●mp_2200 18.48 8.48 46
●mp_2820 13.54 1.54 11 *
●mp_4628 12.95 0.85 6 *
●mp_5000 6.92 2.98 43
●mp_8000 4.33 1.64 38
●mp_10000 2.15 0.25 11 *
●single n/a 3.00 n/a
●TOTAL 92.00 42.50
The GKP FFPE BWA PE mystery
●FFPE sample
○formalin-fixed, paraffin embedded
○long term tissue archival storage method
●Sequencing
○HiSeq-2000
○22 million x 100bp paired-end reads
○quality looks good ... but not mapping!
FastQC for Read 1
FastQC is giving us some hints...
nesoni clip:
Nesoni
●Implemented primarily by Paul Harrison @ VBC
○swiss army knife
○snp calling, phylogenetics, DGE, ....
○extendible pipeline system (Python)
●We used the "nesoni clip:" module
○adaptor clipping = on
○min quality = Q10
○allow Ns = no
○min length = 2
Nesoni command line
nesoni clip:
--quality 10
--length 24
--out-separate yes
SM_AdMP1_ID_07B26948_L004_CLIPPED
pairs:
SM_AdMP1_ID_07B26948_L004_R1.fastq.gz
SM_AdMP1_ID_07B26948_L004_R2.fastq.gz
SM_AdMP1_ID_07B26948_L004_CLIPPED_R1.fq.gz
SM_AdMP1_ID_07B26948_L004_CLIPPED_R2.fq.gz
SM_AdMP1_ID_07B26948_L004_CLIPPED_single.fq.gz
Orphans
Nesoni clip: R1 @ 5' end
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 161,553 read-1 adaptors clipped at start
10 bases: 11757 avg errors: 0.95 3948xPE_Read_2_Sequencing_Primer 1260xMultiplexing_Adapters_1 ...
11 bases: 3212 avg errors: 0.74 758xMultiplexing_PCR_Primer_2.0 269xPE_Read_2_Sequencing_Primer ...
12 bases: 1739 avg errors: 0.62 545xTruSeq_Adapter_Index_1 453xMultiplexing_PCR_Primer_2.0 ...
13 bases: 3054 avg errors: 0.38 2377xTruSeq_Universal_Adapter 429xMultiplexing_PCR_Primer_2.0 ...
14 bases: 760 avg errors: 0.34 357xMultiplexing_PCR_Primer_2.0 163xMultiplexing_Adapters_1 ...
15 bases: 2062 avg errors: 0.13 1558xMultiplexing_PCR_Primer_2.0 261xMultiplexing_Adapters_1 ...
16 bases: 967 avg errors: 0.15 575xMultiplexing_PCR_Primer_2.0 267xMultiplexing_Adapters_1 ...
17 bases: 1271 avg errors: 0.15 603xMultiplexing_PCR_Primer_2.0 419xMultiplexing_Adapters_1 ...
18 bases: 2026 avg errors: 0.23 1095xMultiplexing_PCR_Primer_2.0 483xMultiplexing_Adapters_1 ...
19 bases: 2494 avg errors: 0.07 1956xMultiplexing_PCR_Primer_2.0 422xMultiplexing_Adapters_1 ...
20 bases: 2791 avg errors: 0.07 1481xMultiplexing_Adapters_1 1286xMultiplexing_PCR_Primer_2.0 ...
21 bases: 3946 avg errors: 0.05 3814xMultiplexing_PCR_Primer_2.0 123x3p_RNA_Adapter ...
22 bases: 2238 avg errors: 0.16 1742xMultiplexing_PCR_Primer_2.0 318x3p_RNA_Adapter ...
23 bases: 4207 avg errors: 0.02 4162xMultiplexing_PCR_Primer_2.0 43xTruSeq_Adapter_Index_1 ...
24 bases: 2127 avg errors: 0.03 2086xMultiplexing_PCR_Primer_2.0 33xTruSeq_Adapter_Index_1 ...
25 bases: 1630 avg errors: 0.04 1612xMultiplexing_PCR_Primer_2.0 18xTruSeq_Adapter_Index_4
26 bases: 4031 avg errors: 0.02 4000xMultiplexing_PCR_Primer_2.0 28xTruSeq_Adapter_Index_4 ...
27 bases: 1736 avg errors: 0.04 1626xMultiplexing_PCR_Primer_2.0 110xTruSeq_Adapter_Index_4
28 bases: 3140 avg errors: 0.04 3059xMultiplexing_PCR_Primer_2.0 80xTruSeq_Adapter_Index_4 ...
29 bases: 3277 avg errors: 0.02 3228xMultiplexing_PCR_Primer_2.0 49xTruSeq_Adapter_Index_4
30 bases: 4578 avg errors: 0.03 4542xMultiplexing_PCR_Primer_2.0 36xTruSeq_Adapter_Index_4
31 bases: 4459 avg errors: 0.04 4309xMultiplexing_PCR_Primer_2.0 148xTruSeq_Adapter_Index_4 ...
32 bases: 2753 avg errors: 0.07 2577xMultiplexing_PCR_Primer_2.0 174xTruSeq_Adapter_Index_4 ...
33 bases: 9794 avg errors: 0.05 9255xMultiplexing_PCR_Primer_2.0 536xTruSeq_Adapter_Index_4 ...
34 bases: 63909 avg errors: 0.03 63779xMultiplexing_PCR_Primer_2.0 127xTruSeq_Adapter_Index_4 ...
<snip>
Nesoni clip: R1 @ 3' end
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 9,934,544 read-1 adaptors clipped at end
10 bases: 243222 avg errors: 0.04 224121xTruSeq_Universal_Adapter 11677xPCR_Primer_Index_1 ...
11 bases: 209946 avg errors: 0.02 198663xTruSeq_Universal_Adapter 7666xPCR_Primer_Index_1 ...
12 bases: 287146 avg errors: 0.02 277756xTruSeq_Universal_Adapter 6368xPCR_Primer_Index_1 ...
13 bases: 233067 avg errors: 0.02 222693xTruSeq_Universal_Adapter 7252xPCR_Primer_Index_1 ...
14 bases: 305000 avg errors: 0.02 295013xMultiplexing_PCR_Primer_2.0 7554xPCR_Primer_Index_3 ...
15 bases: 338908 avg errors: 0.02 328225xMultiplexing_PCR_Primer_2.0 8308xPCR_Primer_Index_4 ...
16 bases: 183745 avg errors: 0.02 174275xMultiplexing_PCR_Primer_2.0 7233xPCR_Primer_Index_4 ...
17 bases: 169356 avg errors: 0.02 160221xMultiplexing_PCR_Primer_2.0 6860xPCR_Primer_Index_4 ...
18 bases: 209386 avg errors: 0.02 197727xMultiplexing_PCR_Primer_2.0 7498xPCR_Primer_Index_4 ...
19 bases: 171360 avg errors: 0.02 157072xMultiplexing_PCR_Primer_2.0 8753xPCR_Primer_Index_4 ...
20 bases: 190699 avg errors: 0.02 169311xMultiplexing_PCR_Primer_2.0 18186xPCR_Primer_Index_4 ...
21 bases: 195454 avg errors: 0.02 169850xMultiplexing_PCR_Primer_2.0 22795xPCR_Primer_Index_4 ...
22 bases: 252124 avg errors: 0.21 194698xMultiplexing_PCR_Primer_2.0 47983x3p_RNA_Adapter ...
23 bases: 194228 avg errors: 0.02 183170xMultiplexing_PCR_Primer_2.0 5698xv1.5_Small_RNA_3p_Adapter ...
24 bases: 189333 avg errors: 0.02 178797xMultiplexing_PCR_Primer_2.0 7050xPCR_Primer_Index_4 ...
25 bases: 199061 avg errors: 0.02 190980xMultiplexing_PCR_Primer_2.0 7932xPCR_Primer_Index_4 ...
26 bases: 241032 avg errors: 0.02 235358xMultiplexing_PCR_Primer_2.0 5504xPCR_Primer_Index_4 ...
27 bases: 212056 avg errors: 0.02 206622xMultiplexing_PCR_Primer_2.0 5226xPCR_Primer_Index_4 ...
28 bases: 210022 avg errors: 0.03 203762xMultiplexing_PCR_Primer_2.0 6081xPCR_Primer_Index_4 ...
29 bases: 188424 avg errors: 0.02 183273xMultiplexing_PCR_Primer_2.0 4954xPCR_Primer_Index_4 ...
30 bases: 344312 avg errors: 0.02 339452xMultiplexing_PCR_Primer_2.0 4700xPCR_Primer_Index_4 ...
31 bases: 193023 avg errors: 0.02 187439xMultiplexing_PCR_Primer_2.0 5426xPCR_Primer_Index_4 ...
32 bases: 187459 avg errors: 0.03 180062xMultiplexing_PCR_Primer_2.0 7275xPCR_Primer_Index_4 ...
33 bases: 184063 avg errors: 0.02 180029xMultiplexing_PCR_Primer_2.0 3864xPCR_Primer_Index_4 ...
34 bases: 379388 avg errors: 0.02 209558xTruSeq_Adapter_Index_3 165637xMultiplexing_PCR_Primer_2.0 ...
35 bases: 217332 avg errors: 0.02 212927xTruSeq_Adapter_Index_4 3930xPCR_Primer_Index_4 ...
36 bases: 171253 avg errors: 0.02 166773xTruSeq_Adapter_Index_4 4119xPCR_Primer_Index_4 ...
37 bases: 170924 avg errors: 0.02 165925xTruSeq_Adapter_Index_4 4781xPCR_Primer_Index_4 ...
38 bases: 192662 avg errors: 0.03 187054xTruSeq_Adapter_Index_4 5435xPCR_Primer_Index_4 ...
39 bases: 219074 avg errors: 0.03 215639xTruSeq_Adapter_Index_4 3308xPCR_Primer_Index_4 ...
40 bases: 324519 avg errors: 0.03 321459xTruSeq_Adapter_Index_4 2845xPCR_Primer_Index_4 ...
41 bases: 425086 avg errors: 0.03 422160xTruSeq_Adapter_Index_4 2732xPCR_Primer_Index_4 ...
<snip>
Nesoni clip: R2 @ 5' end
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 142,354 read-2 adaptors clipped at start
10 bases: 12898 avg errors: 0.97 3080xPE_Read_2_Sequencing_Primer 1837x3p_RNA_Adapter ...
11 bases: 4329 avg errors: 0.94 1123xTruSeq_Adapter_Index_1 940xTruSeq_Universal_Adapter ...
12 bases: 1813 avg errors: 0.42 1045xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 158xTruSeq_Universal_Adapter
13 bases: 3153 avg errors: 0.11 2431xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 569xTruSeq_Universal_Adapte
14 bases: 739 avg errors: 0.25 403xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 213xMultiplexing_PCR_Primer
15 bases: 3652 avg errors: 0.09 2645xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 861xMultiplexing_PCR_Primer
16 bases: 465 avg errors: 0.20 295xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 82xMultiplexing_PCR_Primer_
17 bases: 436 avg errors: 0.21 235xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 118xMultiplexing_PCR_Primer
18 bases: 2689 avg errors: 0.15 2280xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 237xMultiplexing_PCR_
19 bases: 584 avg errors: 0.13 402xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 124xTruSeq_Universal_Adapter
20 bases: 982 avg errors: 0.06 716xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 245xTruSeq_Universal_Adapter
21 bases: 1033 avg errors: 0.06 633xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 375xTruSeq_Universal_Adapter.
22 bases: 1530 avg errors: 0.04 1215xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 312xTruSeq_Universal_Adaptr
23 bases: 1112 avg errors: 0.05 752xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 348xTruSeq_Universal_Adapter
24 bases: 501 avg errors: 0.04 302xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 198xTruSeq_Universal_Adapter
25 bases: 3437 avg errors: 0.02 2140xTruSeq_Universal_Adapter 1297xOligonucleotide_sequences_for_Genomic_DNA_Adapte
26 bases: 1473 avg errors: 0.04 856xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 617xTruSeq_Universal_Adapter
27 bases: 12247 avg errors: 0.01 11399xTruSeq_Universal_Adapter 48xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2
28 bases: 1592 avg errors: 0.04 1282xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 310xTruSeq_Universal_Adapter
29 bases: 3129 avg errors: 0.03 1833xOligonucleotide_sequences_for_Genomic_DNA_Adapters_21296xTruSeq_Universal_Adap
30 bases: 2842 avg errors: 0.04 2584xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 258xTruSeq_Universal_Adapter
31 bases: 1533 avg errors: 0.04 1082xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 448xTruSeq_Universal_Adapt
32 bases: 2822 avg errors: 0.04 2398xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 419xTruSeq_Universal_Adapt
33 bases: 12425 avg errors: 0.03 10171xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2254xTruSeq_Universal_Adapter
34 bases: 277 avg errors: 0.03 277xTruSeq_Universal_Adapter
35 bases: 191 avg errors: 0.04 191xTruSeq_Universal_Adapter
36 bases: 347 avg errors: 0.04 347xTruSeq_Universal_Adapter
37 bases: 1696 avg errors: 0.03 1696xTruSeq_Universal_Adapter
38 bases: 5037 avg errors: 0.02 5037xTruSeq_Universal_Adapter
39 bases: 441 avg errors: 0.05 441xTruSeq_Universal_Adapter
40 bases: 4079 avg errors: 0.02 4079xTruSeq_Universal_Adapter
Nesoni clip: R2 @ 3' end
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 9,331,287 read-2 adaptors clipped at end
10 bases: 204601 avg errors: 0.03 200835xTruSeq_Universal_Adapter 556xOligonucleotide_sequences_for_G
11 bases: 200731 avg errors: 0.02 199519xTruSeq_Universal_Adapter 414xOligonucleotide_sequences_for_G
12 bases: 212465 avg errors: 0.01 212195xTruSeq_Universal_Adapter 45xTruSeq_Adapter_Index_1 ...
13 bases: 227861 avg errors: 0.01 227757xTruSeq_Universal_Adapter 31xPE_Adapters_1 ...
14 bases: 313533 avg errors: 0.01 312402xTruSeq_Universal_Adapter 830xPCR_Primers_2 ...
15 bases: 334268 avg errors: 0.01 332651xTruSeq_Universal_Adapter 1178xPCR_Primers_2 ...
16 bases: 188961 avg errors: 0.02 187641xTruSeq_Universal_Adapter 1134xPCR_Primers_2 ...
17 bases: 236185 avg errors: 0.01 235350xTruSeq_Universal_Adapter 692xPCR_Primers_2 ...
18 bases: 173398 avg errors: 0.02 173005xTruSeq_Universal_Adapter 342xPCR_Primers_2 ...
19 bases: 227216 avg errors: 0.02 226929xTruSeq_Universal_Adapter 283xPCR_Primers_2 ...
20 bases: 195904 avg errors: 0.01 195904xTruSeq_Universal_Adapter
21 bases: 171553 avg errors: 0.01 171553xTruSeq_Universal_Adapter
22 bases: 176894 avg errors: 0.01 176894xTruSeq_Universal_Adapter
23 bases: 192393 avg errors: 0.02 192393xTruSeq_Universal_Adapter
24 bases: 291821 avg errors: 0.02 291821xTruSeq_Universal_Adapter
25 bases: 201236 avg errors: 0.02 201236xTruSeq_Universal_Adapter
26 bases: 251535 avg errors: 0.02 251535xTruSeq_Universal_Adapter
27 bases: 215207 avg errors: 0.02 215207xTruSeq_Universal_Adapter
28 bases: 216714 avg errors: 0.02 216714xTruSeq_Universal_Adapter
29 bases: 203192 avg errors: 0.03 203192xTruSeq_Universal_Adapter
30 bases: 820871 avg errors: 0.02 820871xTruSeq_Universal_Adapter
31 bases: 158860 avg errors: 0.02 158860xTruSeq_Universal_Adapter
32 bases: 191083 avg errors: 0.02 191083xTruSeq_Universal_Adapter
33 bases: 159133 avg errors: 0.02 159133xTruSeq_Universal_Adapter
34 bases: 146912 avg errors: 0.02 146912xTruSeq_Universal_Adapter
<snip>
Nesoni clip: summary
nesoni 0.92
FASTQ offset seems to be 33
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 22,611,994 read-pairs
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 121,236 read-1 too short after quality clip
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 483,729 read-1 too short after adaptor clip
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 22,007,029 read-1 kept
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 100.000 read-1 average input length
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 75.426 read-1 average output length
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 673,175 read-2 too short after quality clip
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 419,343 read-2 too short after adaptor clip
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 21,519,476 read-2 kept
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 100.000 read-2 average input length
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 77.201 read-2 average output length
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 21,147,301 pairs kept after clipping
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 1,231,903 reads kept after clipping
started 21 November 2012 11:02 AM
finished 21 November 2012 11:43 AM
run time 0:40:06
Summary
GARBAGE IN,
GARBAGE OUT

Contenu connexe

Tendances

PCR Primer Design Tool for Sanger Sequencing Confirmation
PCR Primer Design Tool for Sanger Sequencing ConfirmationPCR Primer Design Tool for Sanger Sequencing Confirmation
PCR Primer Design Tool for Sanger Sequencing Confirmation
Thermo Fisher Scientific
 
Pcr primer design english version
Pcr primer design english versionPcr primer design english version
Pcr primer design english version
Mahidol University, Thailand
 
Genotyping a deletion mutation in C. elegans via PCR
Genotyping a deletion mutation in C. elegans via PCRGenotyping a deletion mutation in C. elegans via PCR
Genotyping a deletion mutation in C. elegans via PCR
Tiffany Timbers
 
Protein function prediction
Protein function predictionProtein function prediction
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issuesDongyan Zhao
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
QIAGEN
 
Primer designing for pcr and qpcr and their applications
Primer designing for pcr and qpcr and their applicationsPrimer designing for pcr and qpcr and their applications
Primer designing for pcr and qpcr and their applications
All India Institute of Medical Sciences
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
Karan Veer Singh
 
PCR Primer desining
PCR Primer desiningPCR Primer desining
PCR Primer desining
Karan Veer Singh
 
20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop
External RNA Controls Consortium
 
iMate Protocol Guide version 1.2
iMate Protocol Guide version 1.2iMate Protocol Guide version 1.2
iMate Protocol Guide version 1.2
Shigehiro Kuraku (工樂 樹洋)
 
PCR & PRIMER DESIGN
PCR & PRIMER DESIGNPCR & PRIMER DESIGN
PCR & PRIMER DESIGN
sepidehsaroghi
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
BITS
 
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Integrated DNA Technologies
 
Primerdesign
PrimerdesignPrimerdesign
Primerdesign
Aman Ullah
 
2013 polymerase chain-reaction-shakira sulehri
2013 polymerase chain-reaction-shakira sulehri2013 polymerase chain-reaction-shakira sulehri
2013 polymerase chain-reaction-shakira sulehriShakira Sulehri
 
Primers
PrimersPrimers
Introduction to real-Time Quantitative PCR (qPCR) - Download the slides
Introduction to real-Time Quantitative PCR (qPCR) - Download the slidesIntroduction to real-Time Quantitative PCR (qPCR) - Download the slides
Introduction to real-Time Quantitative PCR (qPCR) - Download the slides
QIAGEN
 
iMate Protocol Guide version 3.0
iMate Protocol Guide version 3.0 iMate Protocol Guide version 3.0
iMate Protocol Guide version 3.0
Shigehiro Kuraku (工樂 樹洋)
 

Tendances (20)

PCR Primer Design Tool for Sanger Sequencing Confirmation
PCR Primer Design Tool for Sanger Sequencing ConfirmationPCR Primer Design Tool for Sanger Sequencing Confirmation
PCR Primer Design Tool for Sanger Sequencing Confirmation
 
Pcr primer design english version
Pcr primer design english versionPcr primer design english version
Pcr primer design english version
 
Genotyping a deletion mutation in C. elegans via PCR
Genotyping a deletion mutation in C. elegans via PCRGenotyping a deletion mutation in C. elegans via PCR
Genotyping a deletion mutation in C. elegans via PCR
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
Primer designing for pcr and qpcr and their applications
Primer designing for pcr and qpcr and their applicationsPrimer designing for pcr and qpcr and their applications
Primer designing for pcr and qpcr and their applications
 
DNA_Services
DNA_ServicesDNA_Services
DNA_Services
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
PCR Primer desining
PCR Primer desiningPCR Primer desining
PCR Primer desining
 
20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop
 
iMate Protocol Guide version 1.2
iMate Protocol Guide version 1.2iMate Protocol Guide version 1.2
iMate Protocol Guide version 1.2
 
PCR & PRIMER DESIGN
PCR & PRIMER DESIGNPCR & PRIMER DESIGN
PCR & PRIMER DESIGN
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
 
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
 
Primerdesign
PrimerdesignPrimerdesign
Primerdesign
 
2013 polymerase chain-reaction-shakira sulehri
2013 polymerase chain-reaction-shakira sulehri2013 polymerase chain-reaction-shakira sulehri
2013 polymerase chain-reaction-shakira sulehri
 
Primers
PrimersPrimers
Primers
 
Introduction to real-Time Quantitative PCR (qPCR) - Download the slides
Introduction to real-Time Quantitative PCR (qPCR) - Download the slidesIntroduction to real-Time Quantitative PCR (qPCR) - Download the slides
Introduction to real-Time Quantitative PCR (qPCR) - Download the slides
 
iMate Protocol Guide version 3.0
iMate Protocol Guide version 3.0 iMate Protocol Guide version 3.0
iMate Protocol Guide version 3.0
 

En vedette

Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
Torsten Seemann
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Torsten Seemann
 
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
Torsten Seemann
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
Torsten Seemann
 
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Torsten Seemann
 
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Torsten Seemann
 
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Torsten Seemann
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
Torsten Seemann
 
Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)
Thomas Keane
 
Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Thomas Keane
 
AMR surveillance in Europe: historical background and future outlook. Hajo G...
AMR surveillance in Europe: historical background and future outlook.  Hajo G...AMR surveillance in Europe: historical background and future outlook.  Hajo G...
AMR surveillance in Europe: historical background and future outlook. Hajo G...
European Centre for Disease Prevention and Control (ECDC)
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
Torsten Seemann
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Torsten Seemann
 
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Torsten Seemann
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Torsten Seemann
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013
Torsten Seemann
 
How to write bioinformatics software people will use and cite - t.seemann - ...
How to write bioinformatics software people will use and cite -  t.seemann - ...How to write bioinformatics software people will use and cite -  t.seemann - ...
How to write bioinformatics software people will use and cite - t.seemann - ...
Torsten Seemann
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
Thomas Keane
 
Mouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingMouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingThomas Keane
 
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Игорь Шадеркин
 

En vedette (20)

Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
 
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
 
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
 
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
 
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
 
Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)
 
Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...
 
AMR surveillance in Europe: historical background and future outlook. Hajo G...
AMR surveillance in Europe: historical background and future outlook.  Hajo G...AMR surveillance in Europe: historical background and future outlook.  Hajo G...
AMR surveillance in Europe: historical background and future outlook. Hajo G...
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
 
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013
 
How to write bioinformatics software people will use and cite - t.seemann - ...
How to write bioinformatics software people will use and cite -  t.seemann - ...How to write bioinformatics software people will use and cite -  t.seemann - ...
How to write bioinformatics software people will use and cite - t.seemann - ...
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
 
Mouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingMouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-Editing
 
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
 

Similaire à Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012

RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
BITS
 
137920
137920137920
Convolutional and Recurrent Neural Networks
Convolutional and Recurrent Neural NetworksConvolutional and Recurrent Neural Networks
Convolutional and Recurrent Neural Networks
Ramesh Ragala
 
Assignment 1-mtat
Assignment 1-mtatAssignment 1-mtat
Assignment 1-mtatzafargilani
 
ppt.polymerase_chain_reaction_pcr (1).pdf
ppt.polymerase_chain_reaction_pcr (1).pdfppt.polymerase_chain_reaction_pcr (1).pdf
ppt.polymerase_chain_reaction_pcr (1).pdf
SonuSharma887555
 
PCR.pptx
PCR.pptxPCR.pptx
PCR.pptx
ProsperBonuIre
 
pcr BY ME.docx
pcr BY ME.docxpcr BY ME.docx
pcr BY ME.docx
Pandey141
 
3 pcr
3 pcr3 pcr
qRT-PCR.pdf
qRT-PCR.pdfqRT-PCR.pdf
qRT-PCR.pdf
ShadenAlharbi
 
Polymerase Chain Reaction ( PCR )
Polymerase Chain Reaction ( PCR ) Polymerase Chain Reaction ( PCR )
Polymerase Chain Reaction ( PCR )
Amany Elsayed
 
OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...
OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...
OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...
takuyayamamoto1800
 
RNase H2 PCR—A New Technology to Reduce Primer Dimers and Increase Genotyping...
RNase H2 PCR—A New Technology to Reduce Primer Dimers and Increase Genotyping...RNase H2 PCR—A New Technology to Reduce Primer Dimers and Increase Genotyping...
RNase H2 PCR—A New Technology to Reduce Primer Dimers and Increase Genotyping...
Integrated DNA Technologies
 
Blast fasta 4
Blast fasta 4Blast fasta 4
Blast fasta 4
Er Puspendra Tripathi
 
Primer design
Primer designPrimer design
Primer design
Sabahat Ali
 
Surrey dl-4
Surrey dl-4Surrey dl-4
Surrey dl-4ozzie73
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio data
Intel Nervana
 
Parallel Random Generator - GDC 2015
Parallel Random Generator - GDC 2015Parallel Random Generator - GDC 2015
Parallel Random Generator - GDC 2015
Manchor Ko
 
ParallelRandom-mannyko
ParallelRandom-mannykoParallelRandom-mannyko
ParallelRandom-mannykoManchor Ko
 
2.2 analyzing and manipulating dna
2.2 analyzing and manipulating dna2.2 analyzing and manipulating dna
2.2 analyzing and manipulating dna
Emmanuel Aguon
 
Lattice Cryptography
Lattice CryptographyLattice Cryptography
Lattice Cryptography
Priyanka Aash
 

Similaire à Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012 (20)

RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
137920
137920137920
137920
 
Convolutional and Recurrent Neural Networks
Convolutional and Recurrent Neural NetworksConvolutional and Recurrent Neural Networks
Convolutional and Recurrent Neural Networks
 
Assignment 1-mtat
Assignment 1-mtatAssignment 1-mtat
Assignment 1-mtat
 
ppt.polymerase_chain_reaction_pcr (1).pdf
ppt.polymerase_chain_reaction_pcr (1).pdfppt.polymerase_chain_reaction_pcr (1).pdf
ppt.polymerase_chain_reaction_pcr (1).pdf
 
PCR.pptx
PCR.pptxPCR.pptx
PCR.pptx
 
pcr BY ME.docx
pcr BY ME.docxpcr BY ME.docx
pcr BY ME.docx
 
3 pcr
3 pcr3 pcr
3 pcr
 
qRT-PCR.pdf
qRT-PCR.pdfqRT-PCR.pdf
qRT-PCR.pdf
 
Polymerase Chain Reaction ( PCR )
Polymerase Chain Reaction ( PCR ) Polymerase Chain Reaction ( PCR )
Polymerase Chain Reaction ( PCR )
 
OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...
OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...
OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...
 
RNase H2 PCR—A New Technology to Reduce Primer Dimers and Increase Genotyping...
RNase H2 PCR—A New Technology to Reduce Primer Dimers and Increase Genotyping...RNase H2 PCR—A New Technology to Reduce Primer Dimers and Increase Genotyping...
RNase H2 PCR—A New Technology to Reduce Primer Dimers and Increase Genotyping...
 
Blast fasta 4
Blast fasta 4Blast fasta 4
Blast fasta 4
 
Primer design
Primer designPrimer design
Primer design
 
Surrey dl-4
Surrey dl-4Surrey dl-4
Surrey dl-4
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio data
 
Parallel Random Generator - GDC 2015
Parallel Random Generator - GDC 2015Parallel Random Generator - GDC 2015
Parallel Random Generator - GDC 2015
 
ParallelRandom-mannyko
ParallelRandom-mannykoParallelRandom-mannyko
ParallelRandom-mannyko
 
2.2 analyzing and manipulating dna
2.2 analyzing and manipulating dna2.2 analyzing and manipulating dna
2.2 analyzing and manipulating dna
 
Lattice Cryptography
Lattice CryptographyLattice Cryptography
Lattice Cryptography
 

Plus de Torsten Seemann

How to write bioinformatics software no one will use
How to write bioinformatics software no one will useHow to write bioinformatics software no one will use
How to write bioinformatics software no one will use
Torsten Seemann
 
Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016
Torsten Seemann
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Torsten Seemann
 
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
Torsten Seemann
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Torsten Seemann
 
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Torsten Seemann
 

Plus de Torsten Seemann (6)

How to write bioinformatics software no one will use
How to write bioinformatics software no one will useHow to write bioinformatics software no one will use
How to write bioinformatics software no one will use
 
Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
 
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
 
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
 

Dernier

Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdfThe Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
mediapraxi
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 

Dernier (20)

Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdfThe Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 

Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012

  • 1. Cleaning Illumina reads Torsten Seemann VLSCI :: LSCC - Lab Meeting - Fri 23 Nov 2012
  • 3. Illumina reads ● Usually 100 or 150 bp ○ 250bp rolling out now, 400bp next year ● Indel errors rare ● Homopolymer errors rare ● Substitution errors < 1% ○ Error rate higher at 3' end ● Adaptor issues ○ rare in HiSeq (TruSeq prep) ○ more common in MiSeq (Nextera prep) ● Very high quality overall
  • 4. Illumina libraries ●"single end" ○just a shotgun read sequenced from one end ●"paired end" ○~500bp fragments sequenced at both ends ○very reliable ●"mate pair" ○circularized 2-10 kbp fragments sequencing ○then "paired end" protocol ○reliability varies
  • 5. ● garbage reads ○ instrument weirdness ● duplicate reads ○ low complexity library, PCR duplicate ● adaptor read-through ○ fragment too short ● indel errors ○ skipping bases, inserting extra bases ● uncalled base ○ couldn't reliably estimate, replace with "N" ● substitution errors ○ reading wrong base Sequences have errors More common Less common
  • 6. Why clean reads? ● Erroneous data may cause software to: ○ run more slowly ○ use more RAM ○ produce poor / biased / incorrect results ● Cleaning can: ○ improve overall average quality of the reads ■ hopefully giving a reliable result ○ reduce the volume of reads ■ some algorithms are O(N.logN) or O(N2 ) ■ enable processing when otherwise couldn't ● (some software does handle them appropriately)
  • 7. The Q in FASTQ
  • 8. DNA sequence quality ● DNA sequences often have a quality value associated with each nucleotide ● A measure of reliability for each base ○ as it is derived from physical process ■ chromatogram (Sanger sequencing) ■ pH reading (Ion Torrent sequencing) ● Formalised by the Phred software for the Human Genome Project
  • 9. Phred qualities ● Q = -10 log10 P <=> P = 10- Q / 10 ○ Q = Phred quality score ○ P = probability of base call being incorrect Quality value Chance it is wrong Accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10,000 99.99% 50 1 in 100,000 99.999%
  • 10. Illumina quality plot Y-axis is "Phred" quality values (higher is better)
  • 12. Anatomy of a FASTQ entry @read00179 AGTCTGATATGCTGTACCTATTATAATTCTAGGCGCTCAT GCCCGCGGATATCGTAGCTATATGCTTCA + 8;ACCCD?DD???@B9<9<CAC@=AAAA8A;B<A@882,+ 495;;3990,02..-&-&-*,,,,(0**# Start symbol Sequence ID Sequence Separator line Encoded quality values, one symbol per nucleotide
  • 13. FASTQ quality encoding !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ | | | | | Q0 Q10 Q20 Q30 Q40 bad maybe ok good excellent Uses letters/symbols to represent numbers: Mnemonic: "swear" words! Mnemonic: Q20 = '2' and '0' Mnemonic: 'HI' = high Q
  • 14. Spoiled for choice Solexa !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefgh ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | | | | | 33 59 64 73 104 Illumina 1.3 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefgh ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII | | | | | 33 59 64 73 104 Illumina 1.5 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefgh .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ | | | | | 33 59 64 73 104 Illumina 1.8 (Sanger) !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefgh SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS............................... | | | | | 33 59 64 73 104
  • 15. What can we clean?
  • 16. Ambiguous bases ● If there is ambiguity in the base call, an "N" is used @ILLUMINA:6:1:964:115#GATCAG/1 GGACCTGAGAGTGTGCATGAAGAGGGCAGCGCGCACNGCATCA + HHHGFGEEECDEBA@BBBA<=<;:98743720&,+**_%$#"! ● Possible software responses: ○ Crash! ○ Ignore it ○ Silently convert to fixed or random base (Velvet > 'A') ○ Handle it appropriately (Bowtie2) ● Small proportion overall, safer to discard
  • 17. Homopolymers ● A read consisting of all the same base @ILLUMINA:6:1:964:115#GATCAG/1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + HHHGFGEEECDEBA@BBBA<=<;:98743720&,+**_%$#"! ● Often occur from clusters at edge of flowcell lane ● Early Illumina software called '?' as 'A' rather than N ● Unlikely to be present in real DNA ● Best to discard ● Less common with newer Illumina OLB software
  • 18. Quality trimming ●Remove low quality sequence ○Q=13 corresponds to 5% error (p=0.05) ○Q=0..13 encoded by: !"#@%&'()*+,-/ @ILLUMINA:6:1:9646:1115#GATCAG/1 GGACCTGAGAGTGTGCATGAAGAGGGCAGCCCCGCACTGCATG + HHHGFGEEECDEBA@BBBA<=<;:98743720&,+**_%$#"! ●Can trim per ○each base ○window moving average eg. 3 base mean ○minimum % good per window eg. need 4 of 5
  • 19. Illumina Adaptors ● Used in the sequencing chemistry ● Can appear at ends of read sequences ● Worse for mate-pair than for paired-end reads ● PCR Primer CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT ● Genomic DNA Sequencing Primer CACTCTTTCCCTACACGACGCTCTTCCGATCT ● All other TruSeq & Nextera Adaptors NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
  • 20. Adaptor clipping ● Method ○ Align 3' and 5' read end against all adaptor sequences ○ If there is an anchored "match", trim the read ● Minimum length of match? ○ want to remove adaptor, but not real sequence [10 bp] ● Allow substitutions in match? ○ as reads have errors, need some tolerance [1 sub] ● Allow gaps/indels in match? ○ indels are unlikely in Illumina reads [no] ● Slow to perform compared to other preprocessing steps
  • 21. Decloning ●Illumina "mate pair" sequencing ○Requires a lot of starting DNA ○Challenging protocol to implement reliably ○Not enough final DNA leads to PCR duplicates ○Coverage is highly non-uniform and sporadic ○Causes bias in analyses ●Decloning ○Replace clones with a single representative ○Choose representative with highest quality ○Helps salvage usable information content
  • 22. Read length ●Enforce a minimum read length L ●k-mer based tools ○break reads into k-mers, so L < k is pointless ○Velvet, Trinity, Gossamer, ... ●Uniqueness ○desire reasonable uniqueness of sequence, otherwise multiple mapping will take forever! ○L=24 is bare minimum (I reckon) ○BWA, Bowtie, Shrimp, ...
  • 23. Other strategies ●Digital normalisation ○remove low frequency k-mers ○remove reads w/ too many low freq k-mers ●Error correction ○replace low frequency k-mers with their "closest" high-frequency k-mer ○other methods I don't understand yet ●Just use local alignment to "solve" the problem
  • 25. Walk-through 1. Original read + quality =43bp GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA hhhhghfeefaa^a^[[[^X[[XX^^^^`SSTQPZZBBBBBBB 2. Homopolymer? No GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA 3. Ambiguious N bases? Yes, 1 GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA 4. Quality < 20 ? Yes, at 3' end GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA 5. Adaptor sequences > 8bp ? Yes, 9 bp at 5' end GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA 6. Combine all masks Logical intersection GTTAGCGCGCTGACCATGATTCAAGGAACTGGCCCCATTNATA 7. Extract longest sub-sequence =19bp TGACCATGATTCAAGGAAC
  • 26. The coral genome ● Raw data (A.millepora Illumina) ○ 9 libraries - 3 x PE, 6 x MP - 200bp to 10kbp ○ 92.0 Gbp, 943M reads, average length 98bp ● Method ○ Decloned all MP libs, disallow Ns, reject homopolymers, trim Q < 20 + clip adaptors, minimum length 55bp ● Cleaned data ○ 42.5 Gbp, 478M reads, average length 88bp ● Effect ○ Good - de novo Velvet assembly improved overall ■ validated by RNA-Seq, ESTs ○ Bad - lower coverage ■ coral is very heterozygous > 10%
  • 27. Per library yields (Gbp) ●Library Raw Cleaned %Kept ●pe_193 9.55 6.71 70 ●pe_463 19.19 13.89 72 ●pe_580 4.87 3.18 65 ●mp_2200 18.48 8.48 46 ●mp_2820 13.54 1.54 11 * ●mp_4628 12.95 0.85 6 * ●mp_5000 6.92 2.98 43 ●mp_8000 4.33 1.64 38 ●mp_10000 2.15 0.25 11 * ●single n/a 3.00 n/a ●TOTAL 92.00 42.50
  • 28. The GKP FFPE BWA PE mystery ●FFPE sample ○formalin-fixed, paraffin embedded ○long term tissue archival storage method ●Sequencing ○HiSeq-2000 ○22 million x 100bp paired-end reads ○quality looks good ... but not mapping!
  • 30. FastQC is giving us some hints...
  • 32. Nesoni ●Implemented primarily by Paul Harrison @ VBC ○swiss army knife ○snp calling, phylogenetics, DGE, .... ○extendible pipeline system (Python) ●We used the "nesoni clip:" module ○adaptor clipping = on ○min quality = Q10 ○allow Ns = no ○min length = 2
  • 33. Nesoni command line nesoni clip: --quality 10 --length 24 --out-separate yes SM_AdMP1_ID_07B26948_L004_CLIPPED pairs: SM_AdMP1_ID_07B26948_L004_R1.fastq.gz SM_AdMP1_ID_07B26948_L004_R2.fastq.gz SM_AdMP1_ID_07B26948_L004_CLIPPED_R1.fq.gz SM_AdMP1_ID_07B26948_L004_CLIPPED_R2.fq.gz SM_AdMP1_ID_07B26948_L004_CLIPPED_single.fq.gz Orphans
  • 34. Nesoni clip: R1 @ 5' end (> SM_AdMP1_ID_07B26948_L004_CLIPPED 161,553 read-1 adaptors clipped at start 10 bases: 11757 avg errors: 0.95 3948xPE_Read_2_Sequencing_Primer 1260xMultiplexing_Adapters_1 ... 11 bases: 3212 avg errors: 0.74 758xMultiplexing_PCR_Primer_2.0 269xPE_Read_2_Sequencing_Primer ... 12 bases: 1739 avg errors: 0.62 545xTruSeq_Adapter_Index_1 453xMultiplexing_PCR_Primer_2.0 ... 13 bases: 3054 avg errors: 0.38 2377xTruSeq_Universal_Adapter 429xMultiplexing_PCR_Primer_2.0 ... 14 bases: 760 avg errors: 0.34 357xMultiplexing_PCR_Primer_2.0 163xMultiplexing_Adapters_1 ... 15 bases: 2062 avg errors: 0.13 1558xMultiplexing_PCR_Primer_2.0 261xMultiplexing_Adapters_1 ... 16 bases: 967 avg errors: 0.15 575xMultiplexing_PCR_Primer_2.0 267xMultiplexing_Adapters_1 ... 17 bases: 1271 avg errors: 0.15 603xMultiplexing_PCR_Primer_2.0 419xMultiplexing_Adapters_1 ... 18 bases: 2026 avg errors: 0.23 1095xMultiplexing_PCR_Primer_2.0 483xMultiplexing_Adapters_1 ... 19 bases: 2494 avg errors: 0.07 1956xMultiplexing_PCR_Primer_2.0 422xMultiplexing_Adapters_1 ... 20 bases: 2791 avg errors: 0.07 1481xMultiplexing_Adapters_1 1286xMultiplexing_PCR_Primer_2.0 ... 21 bases: 3946 avg errors: 0.05 3814xMultiplexing_PCR_Primer_2.0 123x3p_RNA_Adapter ... 22 bases: 2238 avg errors: 0.16 1742xMultiplexing_PCR_Primer_2.0 318x3p_RNA_Adapter ... 23 bases: 4207 avg errors: 0.02 4162xMultiplexing_PCR_Primer_2.0 43xTruSeq_Adapter_Index_1 ... 24 bases: 2127 avg errors: 0.03 2086xMultiplexing_PCR_Primer_2.0 33xTruSeq_Adapter_Index_1 ... 25 bases: 1630 avg errors: 0.04 1612xMultiplexing_PCR_Primer_2.0 18xTruSeq_Adapter_Index_4 26 bases: 4031 avg errors: 0.02 4000xMultiplexing_PCR_Primer_2.0 28xTruSeq_Adapter_Index_4 ... 27 bases: 1736 avg errors: 0.04 1626xMultiplexing_PCR_Primer_2.0 110xTruSeq_Adapter_Index_4 28 bases: 3140 avg errors: 0.04 3059xMultiplexing_PCR_Primer_2.0 80xTruSeq_Adapter_Index_4 ... 29 bases: 3277 avg errors: 0.02 3228xMultiplexing_PCR_Primer_2.0 49xTruSeq_Adapter_Index_4 30 bases: 4578 avg errors: 0.03 4542xMultiplexing_PCR_Primer_2.0 36xTruSeq_Adapter_Index_4 31 bases: 4459 avg errors: 0.04 4309xMultiplexing_PCR_Primer_2.0 148xTruSeq_Adapter_Index_4 ... 32 bases: 2753 avg errors: 0.07 2577xMultiplexing_PCR_Primer_2.0 174xTruSeq_Adapter_Index_4 ... 33 bases: 9794 avg errors: 0.05 9255xMultiplexing_PCR_Primer_2.0 536xTruSeq_Adapter_Index_4 ... 34 bases: 63909 avg errors: 0.03 63779xMultiplexing_PCR_Primer_2.0 127xTruSeq_Adapter_Index_4 ... <snip>
  • 35. Nesoni clip: R1 @ 3' end (> SM_AdMP1_ID_07B26948_L004_CLIPPED 9,934,544 read-1 adaptors clipped at end 10 bases: 243222 avg errors: 0.04 224121xTruSeq_Universal_Adapter 11677xPCR_Primer_Index_1 ... 11 bases: 209946 avg errors: 0.02 198663xTruSeq_Universal_Adapter 7666xPCR_Primer_Index_1 ... 12 bases: 287146 avg errors: 0.02 277756xTruSeq_Universal_Adapter 6368xPCR_Primer_Index_1 ... 13 bases: 233067 avg errors: 0.02 222693xTruSeq_Universal_Adapter 7252xPCR_Primer_Index_1 ... 14 bases: 305000 avg errors: 0.02 295013xMultiplexing_PCR_Primer_2.0 7554xPCR_Primer_Index_3 ... 15 bases: 338908 avg errors: 0.02 328225xMultiplexing_PCR_Primer_2.0 8308xPCR_Primer_Index_4 ... 16 bases: 183745 avg errors: 0.02 174275xMultiplexing_PCR_Primer_2.0 7233xPCR_Primer_Index_4 ... 17 bases: 169356 avg errors: 0.02 160221xMultiplexing_PCR_Primer_2.0 6860xPCR_Primer_Index_4 ... 18 bases: 209386 avg errors: 0.02 197727xMultiplexing_PCR_Primer_2.0 7498xPCR_Primer_Index_4 ... 19 bases: 171360 avg errors: 0.02 157072xMultiplexing_PCR_Primer_2.0 8753xPCR_Primer_Index_4 ... 20 bases: 190699 avg errors: 0.02 169311xMultiplexing_PCR_Primer_2.0 18186xPCR_Primer_Index_4 ... 21 bases: 195454 avg errors: 0.02 169850xMultiplexing_PCR_Primer_2.0 22795xPCR_Primer_Index_4 ... 22 bases: 252124 avg errors: 0.21 194698xMultiplexing_PCR_Primer_2.0 47983x3p_RNA_Adapter ... 23 bases: 194228 avg errors: 0.02 183170xMultiplexing_PCR_Primer_2.0 5698xv1.5_Small_RNA_3p_Adapter ... 24 bases: 189333 avg errors: 0.02 178797xMultiplexing_PCR_Primer_2.0 7050xPCR_Primer_Index_4 ... 25 bases: 199061 avg errors: 0.02 190980xMultiplexing_PCR_Primer_2.0 7932xPCR_Primer_Index_4 ... 26 bases: 241032 avg errors: 0.02 235358xMultiplexing_PCR_Primer_2.0 5504xPCR_Primer_Index_4 ... 27 bases: 212056 avg errors: 0.02 206622xMultiplexing_PCR_Primer_2.0 5226xPCR_Primer_Index_4 ... 28 bases: 210022 avg errors: 0.03 203762xMultiplexing_PCR_Primer_2.0 6081xPCR_Primer_Index_4 ... 29 bases: 188424 avg errors: 0.02 183273xMultiplexing_PCR_Primer_2.0 4954xPCR_Primer_Index_4 ... 30 bases: 344312 avg errors: 0.02 339452xMultiplexing_PCR_Primer_2.0 4700xPCR_Primer_Index_4 ... 31 bases: 193023 avg errors: 0.02 187439xMultiplexing_PCR_Primer_2.0 5426xPCR_Primer_Index_4 ... 32 bases: 187459 avg errors: 0.03 180062xMultiplexing_PCR_Primer_2.0 7275xPCR_Primer_Index_4 ... 33 bases: 184063 avg errors: 0.02 180029xMultiplexing_PCR_Primer_2.0 3864xPCR_Primer_Index_4 ... 34 bases: 379388 avg errors: 0.02 209558xTruSeq_Adapter_Index_3 165637xMultiplexing_PCR_Primer_2.0 ... 35 bases: 217332 avg errors: 0.02 212927xTruSeq_Adapter_Index_4 3930xPCR_Primer_Index_4 ... 36 bases: 171253 avg errors: 0.02 166773xTruSeq_Adapter_Index_4 4119xPCR_Primer_Index_4 ... 37 bases: 170924 avg errors: 0.02 165925xTruSeq_Adapter_Index_4 4781xPCR_Primer_Index_4 ... 38 bases: 192662 avg errors: 0.03 187054xTruSeq_Adapter_Index_4 5435xPCR_Primer_Index_4 ... 39 bases: 219074 avg errors: 0.03 215639xTruSeq_Adapter_Index_4 3308xPCR_Primer_Index_4 ... 40 bases: 324519 avg errors: 0.03 321459xTruSeq_Adapter_Index_4 2845xPCR_Primer_Index_4 ... 41 bases: 425086 avg errors: 0.03 422160xTruSeq_Adapter_Index_4 2732xPCR_Primer_Index_4 ... <snip>
  • 36. Nesoni clip: R2 @ 5' end (> SM_AdMP1_ID_07B26948_L004_CLIPPED 142,354 read-2 adaptors clipped at start 10 bases: 12898 avg errors: 0.97 3080xPE_Read_2_Sequencing_Primer 1837x3p_RNA_Adapter ... 11 bases: 4329 avg errors: 0.94 1123xTruSeq_Adapter_Index_1 940xTruSeq_Universal_Adapter ... 12 bases: 1813 avg errors: 0.42 1045xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 158xTruSeq_Universal_Adapter 13 bases: 3153 avg errors: 0.11 2431xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 569xTruSeq_Universal_Adapte 14 bases: 739 avg errors: 0.25 403xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 213xMultiplexing_PCR_Primer 15 bases: 3652 avg errors: 0.09 2645xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 861xMultiplexing_PCR_Primer 16 bases: 465 avg errors: 0.20 295xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 82xMultiplexing_PCR_Primer_ 17 bases: 436 avg errors: 0.21 235xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 118xMultiplexing_PCR_Primer 18 bases: 2689 avg errors: 0.15 2280xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 237xMultiplexing_PCR_ 19 bases: 584 avg errors: 0.13 402xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 124xTruSeq_Universal_Adapter 20 bases: 982 avg errors: 0.06 716xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 245xTruSeq_Universal_Adapter 21 bases: 1033 avg errors: 0.06 633xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 375xTruSeq_Universal_Adapter. 22 bases: 1530 avg errors: 0.04 1215xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 312xTruSeq_Universal_Adaptr 23 bases: 1112 avg errors: 0.05 752xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 348xTruSeq_Universal_Adapter 24 bases: 501 avg errors: 0.04 302xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 198xTruSeq_Universal_Adapter 25 bases: 3437 avg errors: 0.02 2140xTruSeq_Universal_Adapter 1297xOligonucleotide_sequences_for_Genomic_DNA_Adapte 26 bases: 1473 avg errors: 0.04 856xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 617xTruSeq_Universal_Adapter 27 bases: 12247 avg errors: 0.01 11399xTruSeq_Universal_Adapter 48xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 28 bases: 1592 avg errors: 0.04 1282xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 310xTruSeq_Universal_Adapter 29 bases: 3129 avg errors: 0.03 1833xOligonucleotide_sequences_for_Genomic_DNA_Adapters_21296xTruSeq_Universal_Adap 30 bases: 2842 avg errors: 0.04 2584xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 258xTruSeq_Universal_Adapter 31 bases: 1533 avg errors: 0.04 1082xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 448xTruSeq_Universal_Adapt 32 bases: 2822 avg errors: 0.04 2398xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2 419xTruSeq_Universal_Adapt 33 bases: 12425 avg errors: 0.03 10171xOligonucleotide_sequences_for_Genomic_DNA_Adapters_2254xTruSeq_Universal_Adapter 34 bases: 277 avg errors: 0.03 277xTruSeq_Universal_Adapter 35 bases: 191 avg errors: 0.04 191xTruSeq_Universal_Adapter 36 bases: 347 avg errors: 0.04 347xTruSeq_Universal_Adapter 37 bases: 1696 avg errors: 0.03 1696xTruSeq_Universal_Adapter 38 bases: 5037 avg errors: 0.02 5037xTruSeq_Universal_Adapter 39 bases: 441 avg errors: 0.05 441xTruSeq_Universal_Adapter 40 bases: 4079 avg errors: 0.02 4079xTruSeq_Universal_Adapter
  • 37. Nesoni clip: R2 @ 3' end (> SM_AdMP1_ID_07B26948_L004_CLIPPED 9,331,287 read-2 adaptors clipped at end 10 bases: 204601 avg errors: 0.03 200835xTruSeq_Universal_Adapter 556xOligonucleotide_sequences_for_G 11 bases: 200731 avg errors: 0.02 199519xTruSeq_Universal_Adapter 414xOligonucleotide_sequences_for_G 12 bases: 212465 avg errors: 0.01 212195xTruSeq_Universal_Adapter 45xTruSeq_Adapter_Index_1 ... 13 bases: 227861 avg errors: 0.01 227757xTruSeq_Universal_Adapter 31xPE_Adapters_1 ... 14 bases: 313533 avg errors: 0.01 312402xTruSeq_Universal_Adapter 830xPCR_Primers_2 ... 15 bases: 334268 avg errors: 0.01 332651xTruSeq_Universal_Adapter 1178xPCR_Primers_2 ... 16 bases: 188961 avg errors: 0.02 187641xTruSeq_Universal_Adapter 1134xPCR_Primers_2 ... 17 bases: 236185 avg errors: 0.01 235350xTruSeq_Universal_Adapter 692xPCR_Primers_2 ... 18 bases: 173398 avg errors: 0.02 173005xTruSeq_Universal_Adapter 342xPCR_Primers_2 ... 19 bases: 227216 avg errors: 0.02 226929xTruSeq_Universal_Adapter 283xPCR_Primers_2 ... 20 bases: 195904 avg errors: 0.01 195904xTruSeq_Universal_Adapter 21 bases: 171553 avg errors: 0.01 171553xTruSeq_Universal_Adapter 22 bases: 176894 avg errors: 0.01 176894xTruSeq_Universal_Adapter 23 bases: 192393 avg errors: 0.02 192393xTruSeq_Universal_Adapter 24 bases: 291821 avg errors: 0.02 291821xTruSeq_Universal_Adapter 25 bases: 201236 avg errors: 0.02 201236xTruSeq_Universal_Adapter 26 bases: 251535 avg errors: 0.02 251535xTruSeq_Universal_Adapter 27 bases: 215207 avg errors: 0.02 215207xTruSeq_Universal_Adapter 28 bases: 216714 avg errors: 0.02 216714xTruSeq_Universal_Adapter 29 bases: 203192 avg errors: 0.03 203192xTruSeq_Universal_Adapter 30 bases: 820871 avg errors: 0.02 820871xTruSeq_Universal_Adapter 31 bases: 158860 avg errors: 0.02 158860xTruSeq_Universal_Adapter 32 bases: 191083 avg errors: 0.02 191083xTruSeq_Universal_Adapter 33 bases: 159133 avg errors: 0.02 159133xTruSeq_Universal_Adapter 34 bases: 146912 avg errors: 0.02 146912xTruSeq_Universal_Adapter <snip>
  • 38. Nesoni clip: summary nesoni 0.92 FASTQ offset seems to be 33 (> SM_AdMP1_ID_07B26948_L004_CLIPPED 22,611,994 read-pairs (> SM_AdMP1_ID_07B26948_L004_CLIPPED 121,236 read-1 too short after quality clip (> SM_AdMP1_ID_07B26948_L004_CLIPPED 483,729 read-1 too short after adaptor clip (> SM_AdMP1_ID_07B26948_L004_CLIPPED 22,007,029 read-1 kept (> SM_AdMP1_ID_07B26948_L004_CLIPPED 100.000 read-1 average input length (> SM_AdMP1_ID_07B26948_L004_CLIPPED 75.426 read-1 average output length (> SM_AdMP1_ID_07B26948_L004_CLIPPED 673,175 read-2 too short after quality clip (> SM_AdMP1_ID_07B26948_L004_CLIPPED 419,343 read-2 too short after adaptor clip (> SM_AdMP1_ID_07B26948_L004_CLIPPED 21,519,476 read-2 kept (> SM_AdMP1_ID_07B26948_L004_CLIPPED 100.000 read-2 average input length (> SM_AdMP1_ID_07B26948_L004_CLIPPED 77.201 read-2 average output length (> SM_AdMP1_ID_07B26948_L004_CLIPPED 21,147,301 pairs kept after clipping (> SM_AdMP1_ID_07B26948_L004_CLIPPED 1,231,903 reads kept after clipping started 21 November 2012 11:02 AM finished 21 November 2012 11:43 AM run time 0:40:06