Quantum algorithms for pattern matching in genomic sequences - 2018-06-22

Quantum Algorithms
for pattern-matching
in genomic sequences
22-06-2018
Aritra Sarkar
M.Sc. Thesis Project
Quantum Computer Architecture Lab, QuTech
Department of Quantum & Computer Engineering
Faculty of Electrical Engineering, Mathematics and Computer Sciences
Delft University of Technology

2
Presentation overview
● Code of life
● WGS pipeline
● What it is (not)?
● Genomical big data
● Classical approaches
● Sub-sequence index search
● Quantum accelerator
● Searching solutions
● Quantum 101
● Grover search
● Q-search (de-)motivation
● Conditional oracle
● OpenQL kernels
● Q phone directory
● Q associativememory
● Evolution
● Unitary decomposition
● Borrow ing ancilla
● Testing
● IDE w ith circuit designer
● iBAM
● QiBAM
● Algorithm complexity
● Related applications
● Looking |𝑏𝑎𝑐𝑘⟩+|𝑎ℎ𝑒𝑎𝑑⟩
Why? What?
How Quantumly?
Existing ways
Thesis contribution
Into the future

3
Code of life
high sequence similarity usually implies significant functional or structural similarity
Genetic Similarity %
Other humans 99.9
Chimpanzees 98.6
Mouse 92
Cats 90
Cows 85
Dogs 84
Zebra-fish 73
Chicken 65
Banana 60
Honey bee 44
Grapes 24
Yeast 18
E. Coli 7
+ 97% Biological Dark
Matter
Expression
Replication
Metabolism
Reproduction

5
What it is (not)?
Quantum
Biology
- “if evolution is smart enough to create a creature who understands QM, it must be using it for itself”
naturally occurring QM phenomena advantages, not necessarily for Computational purpose
e.g. photosynthesis, navigation in birds, neurons firing, … (sense of smell, emotions, past life, etc.…. keeps getting weirder)
Quantum
Genomics
Quantum-mechanical
Sequencing
Quantum-accelerated
Analysis
Sequencing
Gen2
NGS
Gen3
SMS
Gen1
Illumina
Roche 454
~100 bp
parallelism
high yield
Sanger
~1000 bp
PacificBiosciences
Oxford Nanopore
~10000 bp
Overlap
Layout
Consensus
Pairwise
alignment
de Bruijn
k-mer
Analysis
Sorting
Deduplication
Variant
Calling
Reconstruction
De novoAb initio
(reference-based)
alignment/mapping
(reference-free)
assembly
Exact Heuristic
Approximate
Optimal

6
2-40 EB/year
Genomical Big Data
Genomical big data

7
Classical approaches
Naïve Method
• Substring(/subsequence) matching problem
O(nm) Exact match 1P vs 1T
Boyer-Moore
+ Improvements
Knuth-Morris-Pratt
+ Improvements
Suffix Trees
+ Improvements
O(n+m) ≈ O(n)
Exact match
(wildcards)
nP vs 1T
Needleman-Wunsch
Global Alignment
Smith-Waterman
Local Alignment
Simple EditTranscriptusingmemoizationof LevenshteinDistance
+ Improvements(alphabet/operation weights)
O(nm) Approximatematch
1P vs 1T
multiplesolns.
BYP/CL/Myers/hybrid-dynamicmethods
Alignment witharbitrary (k bounded) gaps
O(nm) Approximatematch
1P vs 1T
multiplesolns.
O(km) Approximatematch
1P vs 1T
multiplesolns.
Burrows-Wheeler-Transform+Smith-Waterman (BWT-SW) All local hits
Burrows-Wheeler-Aligner +super-Maximal Exact Match(BWA-MEM) Heuristicsfaster thanBWT-SW

8
Sub-sequence index search
RG: ReferenceGenome (3 × 108 𝑏𝑝)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
SR: ShortRead (50𝑏𝑝)
0
1
2
4
3
21
21
22
23
24
25
0
1
2
4
3

9
…
for each shortread in sample:
do:
find index in referencegenome
assess answer
while (resultnot satisfactory)
saveshortread matched index
reconstructsequenced genome
…
Quantum accelerator
QASM
Simulator
multi-qubit regime target algorithm
current techs. have ~50 physical qubits
current Q Processor designs are not well scalable
exponentially difficult to simulate qubits
large planar topology yet to be implemented
full connectivity to specific topology can be compiled
number of gates related to total decoherence of result
gate fidelity guarantee with QEC codes
universal set to allow full domain exploration
Unlimited
Qubits
Unlimited
Gates
space complexity is a
critical design parameter
~ 50 bound for
feasible QX simulation
full connectivity
(complete graph)
time complexity is a
critical design parameter
Gate Fidelity = 1 (no errors)
available gates
(σX/Y/Z, H, CX, CZ, Rθ, Toffoli)

10
NP
Searching solutions
𝑦𝑠 = 𝑓(𝑥 𝑠) 𝑦𝑠 = 𝑓(𝑥 𝑠) 𝑦𝑠 = 𝑓(𝑥 𝑠)
𝑥 𝑠 = 𝑓−1
(𝑦𝑠)
𝑦0 = 𝑓 𝑥0
𝑦1 = 𝑓 𝑥1
𝑦2 = 𝑓 𝑥2 = 𝑦𝑠
𝑦3 = 𝑓 𝑥3
⋮
FunctionEvaluation Inductive Logic,GP,ANN,…FunctionInversion
Quantum Superposition
P
Bounded QuantumPolynomial

11
Searching solutions
21st June (yesterday)
Forrelation in BQPPH

12
|2⟩
|1⟩
|0⟩
QC 101
p+
|0⟩
p+
|ψ ⟩ =
𝑖=0
2 𝑛−1
𝛼𝑖|𝑖⟩
|ψ ⟩ = |𝑖⟩ with 𝑃 |𝑖⟩ = 𝛼𝑖
2
𝑖=0
2 𝑛−1
𝛼𝑖
𝑟𝑒2
+ 𝛼𝑖
𝑖𝑚2
= 1
“God does not play dice”
– Albert Einstein
“Don’t tell God what to do”
– Niels Bohr
𝛾0
𝛾1
𝛽0
𝛽1
𝛾0 𝛽0|00⟩+ 𝛾0 𝛽1|01⟩ + 𝛾1 𝛽0|10⟩ + 𝛾1 𝛽1|11⟩
⊗

13
Amplitude
Probability
Grover search
+1|000⟩
+0.3536|000⟩
+0.3536|001⟩
+0.3536|010⟩
+0.3536|011⟩
+0.3536|100⟩
+0.3536|101⟩
+0.3536|110⟩
+0.3536|111⟩
+0.3536|000⟩
+0.3536|001⟩
+0.3536|010⟩
+0.3536|011⟩
+0.3536|100⟩
-0.3536|101⟩
+0.3536|110⟩
+0.3536|111⟩
μ=+0.2652
+0.1768|000⟩
+0.1768|001⟩
+0.1768|010⟩
+0.1768|011⟩
+0.1768|100⟩
+0.8839|101⟩
+0.1768|110⟩
+0.1768|111⟩
3% |000⟩
3% |001⟩
3% |010⟩
3% |011⟩
3% |100⟩
78% |101⟩
3% |110⟩
3% |111⟩
+0.1768|000⟩
+0.1768|001⟩
+0.1768|010⟩
+0.1768|011⟩
+0.1768|100⟩
-0.8839|101⟩
+0.1768|110⟩
+0.1768|111⟩
μ=+0.0442
-0.0884|000⟩
-0.0884|001⟩
-0.0884|010⟩
-0.0884|011⟩
-0.0884|100⟩
+0.9723|101⟩
-0.0884|110⟩
-0.0884|111⟩
0.8% |000⟩
0.8% |001⟩
0.8% |010⟩
0.8% |011⟩
0.8% |100⟩
94.5% |101⟩
0.8% |110⟩
0.8% |111⟩
 Unstructured database
 Black-box Oracle
 Quadratic reductionin query
complexity 𝑂 2 𝑛
 Periodic amplification
H⊗n Oracle
Inversion
about Mean|0⟩⊗n answer
𝑂 2 𝑛 times

14
• Compile once,run many
• Oracle independentof search pattern
Q-search (de-)motivation
Initialise Oracle|0⟩⊗n A.A. index
P
Depends on T
Grover iterations
P. Mateus, Quantum Pattern Matching, Institute Superior Técnico, Aug. 2005, arXiv:quant-ph/0508237 v1,pp.1-5.

15
Conditional oracle
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0
1
2
4
3
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Oracle Fn. A
Oracle Fn. C
Oracle Fn.G
Oracle Fn.T
Reference
SearchPattern
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
MinimizationNP-Hard
Precomputed
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

16
OpenQL kernels
Initialize Make/Call Oracles Inversion about mean

17
• Σ ≔ 𝐴, 𝐶, 𝐺, 𝑇 = 0,1,2,3 = {00,01,10,11}
– 𝐴 = Σ = 4
• 𝑆𝑅 = 𝐺𝐴𝑇 = 100011
– 𝑚 = 𝑆𝑅 = 3
• 𝑅𝐺 = 𝐺𝑇𝐴𝐺𝐴𝑇𝐶𝐴𝐺𝐴 = 10110010001101001000
– 𝑁 = 𝑅𝐺 = 10
• Evolve data register to Hamming distance
• Amplify zero Hamming Distance
– Fixed oracle
• Measure Tag qubits to get match index
Q phone directory
TAG DATA DIST
000 101100 001111
001 110010 010001
010 001000 101011
011 100011 000000
100 001101 101110
101 110100 010111
110 010010 110001
111 001000 101011
10110010001101001000
𝑄𝑡𝑎𝑔
𝑄 𝑑𝑎𝑡𝑎
X
X
X
𝑄𝑡𝑎𝑔
𝑄𝑡𝑎𝑔

18
Q associative memory
D. Ventura et al., Quantum AssociativeMemory, Information Sciences 123, 2000, pp. 273-296.
Store T Recall P|0⟩⊗n
Depends on T Depends on
partial/approx.P
P
Machine
Learning
Pattern
Recognition
Searching
Q
Q
Q

19
Evolution
Tight bounds on
quantum searching
… arbitraryinitial
amplitude distribution
QuantumPattern
Matching
GroverSearch one solution
full, uniform
database
known Oraclefor
solutionin database
optimaliterations
multiple(un)known
solutions
full, uniform
database
known Oraclefor
solutionin database
optimaliterations
multipleknown
solutions
arbitrary database
known Oraclefor
solutionin database
optimaliterations
multipleunknown
solutions
sliding index
database
alphabet based
Oracles
optimaliterations
one solution
sub-string
phonebook
0 Hamming
DistanceOracle
optimaliterations
… Quantum
Bioinformatics
QuantumAssociative
Memory
multipleknown
solutions
arbitrary database
known Oraclefor
solutionin database
higher Pmax
iteration
… associative memory
with distributedqueries
multipleknown
solutions
arbitrary database BinomialOracle optimaliterations
… improveddistributed
queries
multipleunknown
solutions
arbitrary database BinomialOracle
higher Pmax
iteration
Gen 1
(tested)
QUS
Gen 2
(tested)
QPM
Gen 3
(tested)
QNN
Q Walk/ GraphSearchQ Unstructured Search Q StructuredSearch HSP (abelian/dihedral)

20
Cosine-SineDecomposition Quantum Multiplexor
Unitary decomposition
QR Decomposition
𝑂 𝑛3
4 𝑛
CNOT
Lower Bound
1
4
(4 𝑛
− 3𝑛 − 1) CNOT
QS Decomposition
23
48
4 𝑛
−
3
2
2 𝑛
+
4
3
CNOT
Quantum Shannon Decomposition

22
Testing
1. Random String
– Chargaff's Parity rules
• %A = %T
• %C = %G
– %GC : %AT (40:60)
2. Real Data Segment
– part of HBB (hemoglobin subunit beta)
• Chromosome 11 (region p15.4) of Homo sapiens
– Sickle cell anemia
• ATG-GTG-CAT-CTG-ACT-CCT-GAG
• ATG-GTG-CAC-CTG-ACT-CCT-GTG
3. Shortest superstring (Σ,M)
– (2,2) = 00110
– (4,3) = AAATTTGTTCTTATGGTGCTGATCGTCCTCATAGTACTAAGGGCGGAGCCGCAGACGAACCCACAA
– (4,2) = AATTGTCTAGGCGACCA

24
iBAM
• (Quantum) indexed-bidirectional associative memory
– ContentAddressable Storage (CAS)+ RAM
– (Q)BAM is a type of (Q) Neural Network
𝑄 𝑡𝑎𝑔
Min.
Hamming
dist. Oracle

25
QiBAM
Initialise
QPD
Distributed
Query
around 0
|0⟩⊗n
A.A.
index
Depends on T
Evolve to
Hamming
distances
Depends on P
Optimal iterations
Mark all
Memories
A.A.
Distributed
Query
around 0
A.A.
𝑏 𝑥
𝑝 2
= 𝑞 𝑝−𝑥 1 − 𝑞 𝑑− 𝑝−𝑥
𝑂 𝑏 = 1 − 2| ⟩𝑏 𝑝
𝑏 𝑝
|

26
QiBAM: an example
• Reference:AATTGTCTAGGCGACC
• Query: CA

27
Algorithm complexity
QuantumAccelerator
QuantumAlgorithmQuantumState Cloning QuantumState Tomography
Classical
Pre-processing
Classical
Post-processing
Hybrid Compilation
Classical
Program
Classical
Processing
Mapping ErrorCorrection
O ( f(experimental) x g(no-cloning) x h(algorithm) )

28
Related applications
DNA Fingerprinting Motif FindingAmino-acid Sequencing
Pattern based Trading Object RecognitionSpeech Recognition
18x18 px
17 qubits
~ 50k gates
Exact matching

29
Looking |𝑏𝑎𝑐𝑘⟩ + |𝑎ℎ𝑒𝑎𝑑⟩
• get index of search pattern (/short read) in reference string (/genome)
– Solved: use Quantum Phone Directory encoding for storage
• synthesizing Oracle circuit without knowing answer
– Solved: 0 Hamming distance
• approximate optimal matching
– Solved: use distributed query
• RG and SR dependencies well segregated
– Solved: store/recall mechanical of associative memory
• algorithm scaling w.r.t. alphabet, reference and read size
– Solved: scales w.r.t. log2 𝐴𝑙𝑝ℎ𝑎𝑏𝑒𝑡 ∗ 𝑅𝑒𝑎𝑑 + log2 𝑅𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 − 𝑅𝑒𝑎𝑑 + 1
– 130 logical full-connectivity qubits for human DNA with 50bp Illumina reads
Multi-
pattern
matching
Gaps
&
Repeats
Quality
Value
of
reads
Cloning
&
Tomography
Quantum
graph
search
Exponential
speedup
(QFT based)
System
design
ASQC
Other
domains

QuantumAlgorithms
for pattern-matching
in genomicsequences
Aritra Sarkar
Quantum Computer Architecture Lab, QuTech
Departmentof Quantum & Computer Engineering
Faculty of Electrical Engineering, MathematicsandComputer Sciences
Delft University of Technology

31
Milestones
• Nov: study genome-seq.analysis pipeline & identify computationintensive kernels
• Dec: study quantum pattern matching algorithms & implementpapers in MATLAB
• Jan: study quantum pattern matching algorithms & implementin OpenQL(+QuInE)
• Feb: converge on a scheme & explore edge cases,limitations and extensions
• Mar: prove (/implement)extensions of scheme
• Apr: testing algorithm performance& find quantum supremacy problem size + report
• May: explore other use cases of scheme + report
• Jun: presentationand publications + report

32
QC 101
𝑍 = |0⟩
− 𝑍 = |1⟩
𝑋 =
|0⟩ + |1⟩
2
= |+⟩
− 𝑋 =
|0⟩− |1⟩
2
= | − ⟩
𝑌 =
|0⟩ + 𝑖|1⟩
2
= |𝑖⟩− 𝑌 =
|0⟩− 𝑖|1⟩
2
= |−𝑖⟩
|𝜓⟩
𝛾0
𝛾1
𝛽0
𝛽1
𝛾0 𝛽0|00⟩+ 𝛾0 𝛽1|01⟩ + 𝛾1 𝛽0|10⟩ + 𝛾1 𝛽1|11⟩
𝛼00|00⟩ + 𝛼11|11⟩
?
⊗

33
Quantum kernels
• Arbitrary Boolean function
• If state of an index to be marked
– Take Boolean value of Index
– Apply CPhase on all s’ qubits
– X Control on qubits with value = 0
• 111000000
• Sequential copy and increment
• Grover Gate on all s’*M
qubits
• Inversion about Mean
• Amplitude Amplification
Initialize Make/Call Oracles Inversion about mean

34
DNA strings
• 4 Oracles:A, T, G, C
• Qubit complexitylinear in Alphabet size
• Oracles will (typically) be less complexfor higher alphabet sizes
– Mark ~ 1/4th states instead of ~1/2
• Algorithm is more robust for higherAlphabet sizes
– Less possibility of 1 character dominating > 50% of the string
• Finding “13” in “22013230”
– 10 qubits
– 37 h, 38 x, 9 cnot, 33 toffoli gates

35
• States
– −0.366457,+0.000000 0000000110
– (+0.349626,+0.000000) next largest stages
• Circuit
– 10 qubits
– 69 H +117 C0X + 280 C2X
Results
Hollenberg, L.C., 2000.Fast quantum search algorithms in protein sequence comparisons:Quantumbioinformatics. Physical ReviewE, 62(5), p.7532.
Initialise
QPD
Oracle|0⟩⊗n A.A. index
Depends on T
Evolve to
Hamming
distances
Depends on P
Grover iterations

Quantum algorithms for pattern matching in genomic sequences - 2018-06-22

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Quantum algorithms for pattern matching in genomic sequences - 2018-06-22

Similaire à Quantum algorithms for pattern matching in genomic sequences - 2018-06-22 (20)

Plus de Aritra Sarkar

Plus de Aritra Sarkar (16)

Dernier

Dernier (20)

Quantum algorithms for pattern matching in genomic sequences - 2018-06-22