Cara Menggugurkan Kandungan Dengan Cepat Selesai Dalam 24 Jam Secara Alami Bu...
Nanopore Sequencing
1. S
ADAM University of Kyrgyz Republic.
Faculty of medicine
Nanopore Sequencing
Student: Tashfeen Ahmad
Group:GM_4
Teacher: Prof. Domashov Iliya
2. Nanopore Sequencing
Outline
• Nanopore Sequencing Technology
• Raw Data
• Transformations and Raw Data Processing
• Toward Producing a Basecaller
• Future Directions
4. What is Nanopore
Sequencing?
Oxford Nanopore became the
first company to provide a
commercially available
nanopore sequencer in 2015
(available to community in
2012)
5. What is Nanopore
Sequencing?
Nanopore is a disruptive technology:
• Sequencer Size
• Read Length
• Potential direct RNA sequencing
• Biology Problem with Data Velocity Issues
• Currently ~400GB/24 hours needs to
be processed
8. Processing Raw Data
• First step is to create a training data set
• Starting from provided raw data followed by processing to produce useful
data set for training to predict genomic bases
• Goal is to release this package to the community for greater access to create
training data sets for this data
14. Nanopore Raw Correction
1. Center on
insertion
2. Expand to
neighboring
regions
3. Segment using
mean changepoint
Correct
insertions
:
CCC
CCCC
C
CG
G
G
GG
GGG
G
GG
G
GG
GGG
G
GGG
GG
G
G
G
GGG
GGGG
G
G
G
GGGG
GG
G
G
G
GGG
G
G
G
G
G
G
GG
G
G
G
G
G
GGGG
G
G
G
C
CC
CCCCCCCCCC
CCCCC
C
C
CCCCCCC
C
CCCCCCCCC
CCCCCC
CC
A
AAAAAAA
A
AG
GGGGGGG
G
GGGG
CCCCC
C
CCCCCCCCCCCCCCCCCCCCCC
CCCC
CCCC
CCCCCCCCCCCC
C
CCCCCC
CCCCCCCC
CTT
T
T
TTTTTTTTT
T
TT
TTT
TTTTTTTT
TTT
TTT
TTTT
TTT
T
TTTTT
T
TTTTTTTTTTT
TT
TTTTTTT
TTTTTTTT
TT
T
T
TTT
TTTTT
TTTTTTTT
T
T
GGGGGG
GGGGGGGGGG
GGGGGGGG
G
GG
GG
G
G
GGGGG
GGGG
GGGGGGGGGG
G
GG
GG
G
GGGG
G
GGGGG
G
GGGGGG
GG
G
G
G
G
GG
GGGGGG
GGGG
G
GGGGGGGGGGGG
G
G
GGG
G
G
GGGG
G
GG
GGGGGGGGGGGGGGGG
G
GGGG
GGGGGG
GGGGGGGGGGGGGGGGGG
GGG
GGGGG
G
G
GG
GGGGG
G
G
G
GGGGGGGGGGGG
GGGGGGGGGGGGGGGG
G
GGGGGGGGGG
GGGGGGGGGGG
GG
G
GG
GGGG
G
GGG
G
GGGGGGGGGGG
G
G
GGGGG
GGGGGGGGGGG
GGGGGGGGG
GG
GGGGG
GGGGGG
GGG
GGGGG
GGGG
GGGGG
G
GGGGG
GGGGGGGGG
GGGGGGGGGG
GGGG
GG
GGG
GG
GGGGGGGGGGG
G
GG
GG
G
GG
GG
G
GG
GGGG
GGG
GG
GG
GGGGGGGGG
G
G
G
GGGGGGGGG
G
GG
G
G
G
GG
G
G
G
G
GG
GGGGGG
G
GG
GGG
G
G
GG
G
GG
GG
G
GG
G
G
GGG
G
G
GG
C
CCCCCC
CCCCCCC
C
A
AAA
AAAA
AAA
AA
AAA
A
T
TTTTTTTTTTTTTTT
TTTT
T
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTT
AAAA
AAAAAAAAAAAA
AAAAAAAAAAAAAAA
CCCCC
C
CCC
CCCCCCCCCCC
CCCCCCC
CCC
CCCCCCCCCCC
CC
CCCTTTTTTTTT
TTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTT
TT
T
G
GGGGGG
GGGG
DDDDDDDDDD
DDDDDDD
DDDD
D
DDDD
DD
D
DD
D
D
G
GG
GGGG
GG
GG
CC
C
CC
CCCCCC
CCCC
CCCC
A
A
AA
A
A
A
A
A
AA
A
A
AA
A
A
A
A
A
AA
AAAA
A
A
AAA
AAA
A
AA
A
AA
AA
AA
AA
A
AA
A
A
A
AAAA
A
A
AAAAAAAAAA
CCCCC
CCCCCCCC
C
C
C
CCCCCCC
C
C
CCCCCCC
CCCCCC
C
CCC
CC
C
CCCC
GGGGGGG
T
6523000 4652400
CCC
C
C
CG
G
G
GGGGG
G
GG
G
GGGGGGGGG
GG
GG
G
GGGGGGG
GGG
GGGG
GG
G
G
G
GGG
G
G
G
G
G
GGG
GG
G
G
G
GGGGGG
G
C
CC
CCC
CCCCCCC
CCCCC
C
C
CCCCCCC
C
CCCCCCCCC
CCCCCC
CC
A
AAAAAAA
A
AG
GGGGGGG
G
GGGG
CCCCC
C
CCCCCCCCCCCC
CCCCCCCCC
C
C
CCC
CCCC
CCCCCCCCCCCC
C
CCC
CCC
CCCCCCCC
C
TTT
T
TTTTTTTTTT
TTTTT
TTTTTTTT
TTT
TTT
TTTTTTT
T
TTTTT
TTTTTTTTTTTT
TT
TTTTTTTTTTTTTTTTTT
T
TTT
TTTTT
TTTTTTTT
T
T
GGGGG
G
GGGGGGGGG
G
GGGGG
G
GG
G
GG
GG
G
G
GGGG
G
GGGG
GGGGGGGGG
G
G
G
G
GG
G
GGGG
G
GGGGG
G
GGGGGG
GG
G
G
G
G
GG
GGGGGG
TTTT
T
TTTTTTTTTTTT
T
T
TTT
T
T
GGGG
G
GG
GGGGGGGGGGGGGGGG
G
GGGG
GGGGGG
GGGGGGGGGGGGGGGGGG
GGG
GGGGG
G
G
GG
GGGGG
G
G
G
GGGGGGGGGGGG
GGGGGGGGGGGGGGGG
G
G
GGGGGGGGG
GGGGGGGGGGG
GG
G
GG
GGGG
G
GGG
G
GGGGGGGGGGG
G
G
GGGGG
GGGGGGGGGGG
GGGGGGGGG
GG
GGGGG
GGGGGG
GGG
GGGGG
GGGG
GGGGG
G
GGGGG
GGGGGGGGG
GGGGGGGGGG
GGGG
GG
GGG
GG
GGGGGGGGGGG
G
GG
GG
G
GG
GG
G
GG
GGGG
GGG
GG
GG
GGGGGGGGG
G
G
G
GGGGGGGGG
G
GG
G
G
G
GG
G
GG
G
GG
GGGGGGGGG
GGG
G
G
GG
GGGGG
G
GG
G
G
GGG
G
G
GG
C
CCCCCC
CCCCCCC
C
A
AAA
AAAA
AAA
AA
AAA
AT
TTTTTTT
TTTTTTT
T
T
TTT
T
T
TTT
TTTTTTTT
TTTT
TTTTTT
TTTTT
T
TTTTTTTTTT
T
TTTTTTTT
TTTTTTTTTTTTTTTTTTT
TTTTTTTT
TT
T
TT
TTTTTTTT
AAAA
A
AAAA
AAAAAA
A
AAAAAAAAAAAAAAA
CCCCC
C
CCC
CCCCCCCCCCC
CCCCCCC
CCC
CC
C
CCCCCCC
C
CC
C
CCTTTTTTTT
T
TTTTTTTTTTT
TTTTTTTTT
T
TT
T
TTTT
TT
T
TTTTTTTT
T
TT
T
G
GGGGGG
GGGG
GGGGGGGGGG
GGGGGGG
GGGG
G
GGGG
GG
G
GG
G
G
G
GG
GGGGGG
GG
C
C
C
CC
CCCCCC
CCCC
CCCC
A
A
AA
A
A
AAAAA
A
A
AA
A
AAAAAAAAAAAA
AAA
AAA
A
AA
A
AA
AA
AA
AAA
AAA
A
A
AAAA
A
A
AAAAAAAAAA
CCCCCCCCCCCCCCC
C
CCCCCCC
CCCCCCCCCCCCCCCC
CCCCC
C
CCCC
GGGGGGG
6523000 4652400
CCC
CCCC
C
CG
G
G
GG
GGG
G
GG
G
GG
GGG
G
GGG
GG
G
G
G
GGG
GGGG
G
G
G
GGGG
GG
G
G
G
GGG
G
G
G
G
G
G
GG
G
G
G
G
G
GGGG
G
G
G
C
CC
CCCCCCCCCC
CCCCC
C
C
CCCCCCC
C
CCCCCCCCC
CCCCCC
CC
A
AAAAAAA
A
AG
GGGGGGG
G
GGGG
CCCCC
C
CCCCCCCCCCCCCCCCCCCCCC
CCCC
CCCC
CCCCCCCCCCCC
C
CCCCCC
CCCCCCCC
CTT
T
T
TTTTTTTTT
T
TT
TTT
TTTTTTTT
TTT
TTT
TTTT
TTT
T
TTTTT
T
TTTTTTTTTTT
TT
TTTTTTT
TTTTTTTT
TT
T
T
TTT
TTTTT
TTTTTTTT
T
T
GGGGGG
GGGGGGGGGG
GGGGGGGG
G
GG
GG
G
G
GGGGG
GGGG
GGGGGGGGGG
G
GG
GG
G
GGGG
G
GGGGG
G
GGGGGG
GG
G
G
G
G
GG
GGGGGG
GGGG
G
GGGGGGGGGGGG
G
G
GGGGG
GGGG
G
GG
GGGGGGGGGGGGGGGG
G
GGGG
GGGGGG
GGGGGGGGGGGGGGGGGG
GGG
GGGGG
G
G
GG
GGGGG
G
G
G
GGGGGGGGGGGG
GGGGGGGGGGGGGGGG
G
G
GGGGGGGGG
GGGGGGGGGGG
GG
G
GG
GGGG
G
GGG
G
GGGGGGGGGGG
G
G
GGGGG
GGGGGGGGGGG
GGGGGGGGG
GG
GGGGG
GGGGGG
GGG
GGGGG
GGGG
GGGGG
G
GGGGG
GGGGGGGGG
GGGGGGGGGG
GGGG
GG
GGG
GG
GGGGGGGGGGG
G
GG
GG
G
GG
GG
G
GG
GGGG
GGG
GG
GG
GGGGGGGGG
G
G
G
GGGGGGGGG
G
GG
G
G
G
GG
G
G
G
G
GG
GGGGGG
G
GG
GGG
G
G
GG
G
GG
GG
G
GG
G
G
GGG
G
G
GG
C
CCCCCC
CCCCCCC
C
A
AAA
AAAA
AAA
AA
AAA
A
T
TTTTTTTTTTTTTTT
TTTT
T
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTT
AAAA
AAAAAAAAAAAA
AAAAAAAAAAAAAAA
CCCCC
C
CCC
CCCCCCCCCCC
CCCCCCC
CCC
CCCCCCCCCCC
CC
CCCTTTTTTTTT
TTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTT
TT
T
G
GGGGGG
GGGG
DDDDDDDDDD
DDDDDDD
DDDD
D
DDDD
DD
D
DD
D
D
G
GG
GGGG
GG
GG
CC
C
CC
CCCCCC
CCCC
CCCC
A
A
AA
A
A
A
A
A
AA
A
A
AA
A
A
A
A
A
AA
AAAA
A
A
AAA
AAA
A
AA
A
AA
AA
AA
AA
A
AA
A
A
A
AAAA
A
A
AAAAAAAAAA
CCCCC
CCCCCCCC
C
C
C
CCCCCCC
C
C
CCCCCCC
CCCCCC
C
CCC
CC
C
CCCC
GGGGGGG
T
6523000 4652400
CCC
C
C
CG
G
G
GGGGG
G
GG
G
GGGGGGGGG
GG
GG
G
GGGGGGG
GGG
GGGG
GG
G
G
G
GGG
G
G
G
G
G
GGG
GG
G
G
G
GGGGGG
G
C
CC
CCC
CCCCCCC
CCCCC
C
C
CCCCCCC
C
CCCCCCCCC
CCCCCC
CC
A
AAAAAAA
A
AG
GGGGGGG
G
GGGG
CCCCC
C
CCCCCCCCCCCC
CCCCCCCCC
C
C
CCC
CCCC
CCCCCCCCCCCC
C
CCC
CCC
CCCCCCCC
C
TTT
T
TTTTTTTTTT
TTTTT
TTTTTTTT
TTT
TTT
TTTTTTT
T
TTTTT
TTTTTTTTTTTT
TT
TTTTTTTTTTTTTTTTTT
T
TTT
TTTTT
TTTTTTTT
T
T
GGGGG
G
GGGGGGGGG
G
GGGGG
G
GG
G
GG
GG
G
G
GGGG
G
GGGG
GGGGGGGGG
G
G
G
G
GG
G
GGGG
G
GGGGG
G
GGGGGG
GG
G
G
G
G
GG
GGGGGG
TTTT
T
TTTTTTTTTTTT
T
T
TTT
T
T
GGGG
G
GG
GGGGGGGGGGGGGGGG
G
GGGG
GGGGGG
GGGGGGGGGGGGGGGGGG
GGG
GGGGG
G
G
GG
GGGGG
G
G
G
GGGGGGGGGGGG
GGGGGGGGGGGGGGGG
G
G
GGGGGGGGG
GGGGGGGGGGG
GG
G
GG
GGGG
G
GGG
G
GGGGGGGGGGG
G
G
GGGGG
GGGGGGGGGGG
GGGGGGGGG
GG
GGGGG
GGGGGG
GGG
GGGGG
GGGG
GGGGG
G
GGGGG
GGGGGGGGG
GGGGGGGGGG
GGGG
GG
GGG
GG
GGGGGGGGGGG
G
GG
GG
G
GG
GG
G
GG
GGGG
GGG
GG
GG
GGGGGGGGG
G
G
G
GGGGGGGGG
G
GG
G
G
G
GG
G
GG
G
GG
GGGGGGGGG
GGG
G
G
GG
GGGGG
G
GG
G
G
GGG
G
G
GG
C
CCCCCC
CCCCCCC
C
A
AAA
AAAA
AAA
AA
AAA
AT
TTTTTTT
TTTTTTT
T
T
TTT
T
T
TTT
TTTTTTTT
TTTT
TTTTTT
TTTTT
T
TTTTTTTTTT
T
TTTTTTTT
TTTTTTTTTTTTTTTTTTT
TTTTTTTT
TT
T
TT
TTTTTTTT
AAAA
A
AAAA
AAAAAA
A
AAAAAAAAAAAAAAA
CCCCC
C
CCC
CCCCCCCCCCC
CCCCCCC
CCC
CC
C
CCCCCCC
C
CC
C
CCTTTTTTTT
T
TTTTTTTTTTT
TTTTTTTTT
T
TT
T
TTTT
TT
T
TTTTTTTT
T
TT
T
G
GGGGGG
GGGG
GGGGGGGGGG
GGGGGGG
GGGG
G
GGGG
GG
G
GG
G
G
G
GG
GGGGGG
GG
C
C
C
CC
CCCCCC
CCCC
CCCC
A
A
AA
A
A
AAAAA
A
A
AA
A
AAAAAAAAAAAA
AAA
AAA
A
AA
A
AA
AA
AA
AAA
AAA
A
A
AAAA
A
A
AAAAAAAAAA
CCCCCCCCCCCCCCC
C
CCCCCCC
CCCCCCCCCCCCCCCC
CCCCC
C
CCCC
GGGGGGG
6523000 4652400
16. Raw Nanopore Data
• Noise level is quite high (hopeful for improvements in base technology)
• Shown above is the same DNA sequence observed 8 times
17. Toward a Basecaller
Post correction and
normalization distributions
• Clearly some signal exists
before complex machine
learning
• ~13% accuracy achievable by
nearest mean calculations
18. Toward a Basecaller
• Oxford Nanopore has recently upgraded to a RNN basecaller which
produces reads with ~85% accuracy, thought it is still computationally
intensive
• Larger sequencer (PromethION) produces 12Tb of data in 48 hours (up to
1.44GBps) with current machine requiring ~1kW.
19. Toward a Basecaller
Current event (base) segmentation is
done using an FPGA t-test and all
computation (RNN) is completed on
the mean and SD of these segments
We are currently working to integrate
basecalling and segmentation directly
from the raw data via an RNN with
potentially vast improvements in
accuracy as well as speed which will
become increasingly important with
throughput improvements. 0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
FPRate
TPRate
−2
−1
0
1
log10FDR
20. Challenges
• Data Velocity
• Basecaller must be able to keep up with the increasing speed of the data
• Accuracy
• Basecaller must be accurate enough to provide meaningful biological
insight
• Adaptabiltiy
• Would like to be able to interrogate the data in order to assess confidence
as well as possible alterations outside of the given model
21. Future Directions
• Produce 1D basecalls on par with current algorithms ~70-80%
• Exploring architectures and pre-processing
• Investigate base alterations (methylation, acetylation, etc.) via encoding layers
• Release package to create raw data training sets and provide QC metrics for
raw reads.