A review of two alignment-free methods for sequence comparison. In this presentation two alignment-free methods are studied:
- "Similarity analysis of DNA sequences based on LZ complexity and dynamic programming algorithm" by Guo et al.
- "Alignment-free comparison of genome sequences by a new numerical characterization" by Huang et al.
2. Outline
• Introduction to sequence alignment problem
• Introduction to alignment-free sequence comparison
• An LZ-complexity based alignment method
• A 2D graphical alignment method
• Methods overall comparison
3. Introduction to sequence alignment
• Goal: determine if a particular sequence is like another sequence
• determine if a database contains a potential homologous sequence.
4. Introduction to sequence alignment
• Two alignment types are used: global and local
• The global approach compares one whole sequence with other entire
sequences.
• The local method uses a subset of a sequence and attempts to align it to
subset of other sequences.
5. Introduction to sequence alignment
• The global alignment looks for comparison over the entire range of
the two sequences involved.
GCATTACTAATATATTAGTAAATCAGAGTAGTA
||||||||| ||
AAGCGAATAATATATTTATACTCAGATTATTGCGCG
6. Introduction to sequence alignment
• By contrast, when a local alignment is performed, a small seed is
uncovered that can be used to quickly extend the alignment.
• The initial seed for the alignment:
TAT
|||
AAGCGAATAATATATTTATACTCAGATTATTGCGCG
7. Introduction to sequence alignment
• By contrast, when a local alignment is performed, a small seed is
uncovered that can be used to quickly extend the alignment.
• And now the extended alignment:
TATATATTAGTA
||||||||| ||
AAGCGAATAATATATTTATACTCAGATTATTGCGCG
8. Introduction to sequence alignment
• How to search similiarities in genetic sequences?
• Naive methods: comparing all possibles alignments (extremely slow)
• Heuristics methods
• Examples: BLAST, FASTA, …
• Optimal solution is not guaranteed
• Tradeoff: Speed vs Accuracy
• Dynamic programming methods
• Examples: Needleman & Wunsch, Smith & Waterman
9. Introduction to sequence alignment
• How to search similiarities in genetic sequences?
• Naive methods: comparing all possibles alignments (extremely slow)
• Heuristics methods
• Examples: BLASTA, FASTA, …
• Optimal solution is not guaranteed
• Tradeoff: Speed vs Accuracy
• Dynamic programming methods
• Examples: Needleman & Wunsch, Smith & Waterman
• …faster alternatives?
10. Alignment-free comparison
• Challenge: overcome the traditional alignment-based algorithm
inefficiency
• Alignment-based methods
• Slow
• May produce incorrect results when used on more divergent but functionally
related sequences
11. Alignment-free comparison
• Much faster than alignment-based methods
• most methods work in linear time
• Four categories:
• methods based on k-mer/word frequency,
• methods based on substrings,
• methods based on information theory (LZ-complexity based method) and
• methods based on graphical representation (2D-graphical method)
12. LZ-complexity based sequence comparison
• Method based on information theory
• Analysis of DNA/Proteic sequences
• Built upon the LZ-complexity measure
• Dynamic programming algorithm
13. LZ-complexity
• Complexity measure for finite sequences
• LZ-complexity as entropy rate estimator for finite sequences
• Produces a dictionary of productions for a sequence 𝑆.
• “The proposed complexity measure is related to the number of steps
in a self-delimiting production process by which a given sequence is
presumed to be generated” (Abraham Lempel and Jacob Ziv, "On the Complexity of
Individual Sequences“, 1976)
14. LZ-complexity (production process)
• 𝑚-step production process of a finite sequence 𝑆
𝐻 𝑆 = 𝑆 1, ℎ1 ∗ 𝑆 ℎ1 + 1, ℎ2 , … , 𝑆(ℎ 𝑚−1 + 1, ℎ 𝑚)
• 𝐻 𝑆 is called history of 𝑆 and 𝐻𝑖 𝑆 = 𝑆(ℎ𝑖−1 + 1, ℎ𝑖) is called the
ith component of 𝐻 𝑆 .
• Each component 𝐻𝑖 𝑆 is added into a dictionary
15. LZ-complexity (algorithm)
Initialize the dictionary
repeat until the sequence have not been consumed
Add the next symbol to the current subsequence.
If the subsequence is reproducible from the previous history, add to the
dictionary and increase index value
16. LZ-complexity (algorithm)
Initialize the dictionary
repeat until the sequence have not been consumed
Add the next symbol to the current subsequence.
If the subsequence is reproducible from the previous history, add to the
dictionary and increase index value
The production process inserts a comma (',') into a sequence 𝑆 after the
creation of each new phrase formed by the concatenation of the longest
recognized dictionary phrase and the innovative symbol that follows.
17. LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A
2 T
3 G
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C
18. LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T
3 G
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C
19. LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C
20. LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C
21. LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C
7 G
8 G
9 T
10 T
11 T
12 C
22. LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T
11 T
12 C
23. LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C
24. LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C TC 7
25. LZ-complexity (Example)
• S = ATGGTCGGTTTC
• The complexity 𝑐 𝑆 of the
sequence S is
• 𝑐 𝑆 = 7
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C TC 7
26. LZ-complexity (Example)
• S = ATGGTCGGTTTC
• The complexity 𝑐 𝑆 of the
sequence S is
• 𝑐 𝑆 = 7
• The history of 𝑆 is
• 𝐻 𝑆 = {𝐴, 𝑇, 𝐺, 𝐺𝑇, 𝐶, 𝐺𝐺𝑇𝑇, 𝑇𝐶}
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C TC 7
27. LZ-complexity based sequence comparison
• Based on the number of components in the LZ-complexity
decomposition of the DNA sequences.
• Given two sequences S and Q decomposed using the LZ-complexity:
𝑆 = 𝑆1 𝑆2…𝑆 𝑘…𝑆 𝑚
𝑄 = 𝑄1 𝑄…𝑄 𝑘…𝑄 𝑛
𝑚 is the number of fragments of 𝑆
𝑛 is the number of fragments of 𝑄
28. LZ-complexity based sequence comparison
• Let 𝜎 be a score function used to build the dynamic programming
matrix. It is defined as follows:
𝜎 𝑆𝑖, _ = 𝜎 _, 𝑄𝑖 = 1
𝜎 𝑆𝑖, 𝑄𝑗 = 1 −
𝑁(𝑆𝑖, 𝑄𝑗)
max 𝑙𝑒𝑛𝑔𝑡ℎ(𝑆𝑖, 𝑄𝑗)
• where 𝑁(𝑆𝑖, 𝑄𝑗) is the number of the same elements of fragment 𝑆𝑖
and 𝑄𝑗.
30. Example
Q→ A T G TGA ATGC AT
S↓ 0 1 2 4 8 16 32
A 1 0 1 2 3 4 5
T 2 1 0 1 2 3 4
G 4 2 1 0 1 2 3
GT 8 3 2 1 0.333 1.333 2.333
C 16 4 3 2 1.333 1.083 2.083
GGTT 32 5 4 3 2.333 1.833 1.833
TC 64 6 5 4 3.333 2.833 2.333
31. Example
𝑀[𝑚, 𝑛] is the similarity distance between sequences 𝑆 and 𝑄
Q→ A T G TGA ATGC AT
S↓ 0 1 2 4 8 16 32
A 1 0 1 2 3 4 5
T 2 1 0 1 2 3 4
G 4 2 1 0 1 2 3
GT 8 3 2 1 0.333 1.333 2.333
C 16 4 3 2 1.333 1.083 2.083
GGTT 32 5 4 3 2.333 1.833 1.833
TC 64 6 5 4 3.333 2.833 2.333
32. Results
• Data set: sequences of the firtst exon of 𝛽-globin gene of 11 species
• Method:
Calculate the similarity degree among the sequences using the proposed
method (LZ-complexity + dynamic programming)
Arrange all the similarity degrees into a matrix
Put the pair-wise distances into a neighbor-joining program in the PHYLIP
package
35. G. Huang et al. (2D-graphical method)
• Method based on graphical representation
• Four vector correspond to four groups of nucleotides:
𝐴 → (1, −
3
3)
𝑇 → (1,
3
2)
𝐺 → (1, − 5)
𝐶 → (1, 3)
36. G. Huang et al. (2D-graphical method)
• DNA sequence can be turned into a graphical curve
37. G. Huang et al. (2D-graphical method)
• Graphs shows intuitively (dis)similarity between sequences.
38. G. Huang et al. (2D-graphical method)
• Graphs shows intuitively (dis)similarity between sequences.
39. G. Huang et al. (2D-graphical method)
• How to compare sequences?
• Similarity among sequences can be quantified by computing distance
between either vectors or points.
• Spatial distances
• Euclidean distance
• Mahalanobis distance
• Standard Euclidean distance
• Cosine similarity
• Stuart et al. (2002)
40. Euclidean distance
• Given two vectors 𝐴 = {𝑎1, 𝑎2, … , 𝑎 𝑛} and 𝐵 = {𝑏1, 𝑏2, … , 𝑏 𝑛}, the
Euclidean distance is computed as follow:
𝐸𝐷 𝐴, 𝐵 =
𝑖=1
𝑛
𝑎𝑖 − 𝑏𝑖
2
41. Mahalanobis distance
• The Mahalanobis distance takes into account the data covariance
relationship. It is defined as follow:
𝑀𝐷 𝐴, 𝐵 = 𝐴 − 𝐵 𝐶𝑉−1 𝐴 − 𝐵 ′
• 𝐶𝑉 is the covariance matrix
43. Cosine similarity
• Stuart et al. define a distance using the angles between vectors. It is
defined as follow:
𝐴𝐷 𝐴, 𝐵 =
𝐴 ∙ 𝐵
𝐴 × 𝐵
=
𝑖=1
𝑛
𝑎𝑖 𝑏𝑖
𝑖=1
𝑛
𝑎𝑖
2
𝑖=1
𝑛
𝑏𝑖
2
𝐸𝐴𝐷 𝐴, 𝐵 = − ln 1 + 𝐴𝐷 𝐴, 𝐵 ∕ 2
• Where 𝐴𝐷(𝐴, 𝐵) is the cosine similarity between 𝐴 and 𝐵, 𝐸𝐴𝐷 𝐴, 𝐵
represents the evolutionary distance between 𝐴 and 𝐵.
44. Results
• Two data sets have been used
• a real sequences set
• Human mithocondrial genome
• a random sequences set
• Obtained by applying random mutation on the real sequences set
(1%, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% mutatio
n rates)
• Euclidean, SED, Mahalanobis and EAD distance have been used
45. Results
• 𝑑 𝑥 denotes the distance
between a sequence and its
randomly mutated version.
• The Euclidian distance is more
sensitive to mutation rate than
the other three distance.
46. Results
• 35 mitochondrial genome sequences from different mammals
(GeneBank db)
• Primates species including human, ape, gorilla, chimpazees, etc. are
grouped together
• Result is in agreement with that obtained by Yu et al.(2010) and Raina
et al. (2005)
48. Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
49. Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
50. Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Position Symbol Add to dictionary Index Rate
1 A A 1 1
2 T T 2 1
3 G G 3 1
4 G
5 T GT 4 0.80
.. .. .. .. ..
𝐸𝐷 𝐴, 𝐵 =
𝑖=1
𝑛
𝑎𝑖 − 𝑏𝑖
2
51. Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
52. Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Notes de l'éditeur
PHYLIP is a free package of programs for inferring phylogenies