1. Surya Saha
Sol Genomics Network (SGN)
Boyce Thompson Institute, Ithaca, NY
ss2489@cornell.edu // @SahaSurya
BTI PGRP Summer Internship Program 2014
http://www.acgt.me/blog/2014/3/7/next-generation-sequencing-must-die
2. Why Sequencing?
• Targeted interrogation
of genome
• Economical
• Technological
developments
• High-throughput assays
• But requires subsequent
validation
7/8/2014 BTI PGRP Summer Internship Program 2014 2
3. 1953
DNA Structure
discovery
1977
2012
Sanger DNA sequencing by
chain-terminating inhibitors
1984
Epstein-Barr
virus
(170 Kb)
1987Abi370
Sequencer
1995
2001
Homo
sapiens
(3.0 Gb)
2005
454
Solexa
Solid
2007
2011
Ion
Torrent
PacBio
Haemophilus
influenzae
(1.83 Mb)
2013
Slide credit: Aureliano Bombarely
Sequencing over the Ages
Illumina
Illumina
Hiseq X
454
7/8/2014 BTI PGRP Summer Internship Program 2014 3
Pinus
taeda
(24 Gb)
5. Sanger method
7/8/2014 BTI PGRP Summer Internship Program 2014 5
Frederick Sanger
13 Aug 1918 – 19 Nov 2013
Won the Nobel Prize for Chemistry in 1958 and
1980. Published the dideoxy chain termination
method or “Sanger method” in 1977
http://dailym.ai/1f1XeTB
7. First generation sequencing
• Very high quality sequences (99.999%)
• Very low throughput
7/8/2014 BTI PGRP Summer Internship Program 2014 7
Run Time Read Length Reads / Run
Total
nucleotides
sequenced
Cost / MB
Capillary
Sequencing
(ABI3730xl)
20m-3h 400-900 bp 96 or 386 1.9-84 Kb $2400
http://bit.ly/1clLps3
http://1.usa.gov/1cLqIRd
8. Use the specific technology used
to generate the data
– Illumina Hiseq/Miseq/NextSeq
– Pacific Biosciences RS1/RSII
– Ion Torrent Proton/PGM
– SOLiD
7/8/2014 BTI PGRP Summer Internship Program 2014 8
http://www.acgt.me/blog/2014/3/10/next-generation-
sequencing-must-diepart-2
9. 454 Pyrosequencing
One purified DNA
fragment, to one bead, to
one read.
7/8/2014 BTI PGRP Summer Internship Program 2014 9
http://bit.ly/1ehwxWN
GS FLX
Titanium
http://bit.ly/1ehAcEh
10. Illumina
7/8/2014 BTI PGRP Summer Internship Program 2014 10
Output 15 Gb 120 GB 1000 GB 1800 GB
Number
of Reads
25 Million 400 Million 4 Billion 6 Billion
Read
Length
2x300 bp 2x150 bp 2x125 bp
(2x250 update mid-2014)
2x150 bp
Cost $99K $250K $740K $10M
Source: Illumina
13. Pacific Biosciences SMRT sequencing
Single Molecule Real
Time sequencing
7/8/2014 BTI PGRP Summer Internship Program 2014 13
http://bit.ly/1naxgTe
14. Pacific Biosciences SMRT sequencing
Error correction methods
7/8/2014 BTI PGRP Summer Internship Program 2014 14
Hierarchical genome-assembly
process (HGAP)
PBJelly
Enlish et al., PLOS One. 2012
PBJelly
15. 7/8/2014 Centre for Agricultural Bioinformatics, Pusa 15
Pacific Biosciences SMRT sequencing
Read Lengths
16. Oxford Nanopore
7/8/2014 Centre for Agricultural Bioinformatics, Pusa 16
https://www.nanoporetech.com/
• No data yet??
• Error model
http://erlichya.tumblr.com/post/66376172948/hands-on-
experience-with-oxford-nanopore-minion
22. Real cost of Sequencing!!
Sboner, Genome Biology, 2011
7/8/2014 22Centre for Agricultural Bioinformatics, Pusa
23. Library Types
Single end
Pair end (PE, 150-800 bp, Fwd:/1, Rev:/2)
Mate pair (MP, 2Kb to 20 Kb)
7/8/2014 BTI PGRP Summer Internship Program 2014 23
F
F R
F R 454/Roche
FR Illumina
Illumina
Slide credit: Aureliano Bombarely
24. Implications of Choice of Library
7/8/2014 BTI PGRP Summer Internship Program 2014 24
Slide credit: Aureliano Bombarely
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Pseudomolecule
(or ultracontig)
F
Genetic information (markers)
NNNNN NN
25. Multiplexing Libraries
Use of different tags (4-6 nucleotides) to identify
different samples in the same lane/sector.
7/8/2014 BTI PGRP Summer Internship Program 2014 25
Slide credit: Aureliano Bombarely
AGTCGT
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
Sequencing
26. Fasta files:
It is a text-based format for representing either nucleotide sequences or peptide
sequences, in which nucleotides or amino acids are represented using single-letter codes.
-Wikipedia
File Formats
7/8/2014 BTI PGRP Summer Internship Program 2014 26
Slide credit: Aureliano Bombarely
27. Fastq files:
FASTQ format is a text-based format for storing both a biological sequence (usually
nucleotide sequence) and its corresponding quality scores.
-Wikipedia
• Single line ID with at symbol (“@”) in the first column.
• Sequences can be in multiple lines after the ID line
• Single line with plus symbol (“+”) in the first column to represent the quality line.
• Quality ID line may contain ID
• Quality values are in multiple lines after the + line but length should be identical to sequence
7/8/2014 BTI PGRP Summer Internship Program 2014 27
Slide credit: Aureliano Bombarely
File Formats
28. 7/8/2014 BTI PGRP Summer Internship Program 2014 28
Quality control: Encoding
Fastq files:
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[]^_`abcdefgh Offset by 64 (Phred+64)
29. Quality control: Encoding
7/8/2014 BTI PGRP Summer Internship Program 2014 29
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[]^_`abcdefgh Offset by 64 (Phred+64)
30. 7/8/2014 BTI PGRP Summer Internship Program 2014 30
Quality control: Encoding
http://bit.ly/N28yUd
Phred score of a base is:
Qphred = -10 log10 (e)
where e is the estimated probability of a base
being wrong