1. .
Algorithm of Suffix Tree
by farseerfc@gmail.com
.
Jiachen Yang1
1 Research Student in Osaka University
November 12, 2011
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 1 / 25
2. Part Outline
.
1 What is Suffix Tree
.
2 History & Naïve Algorithm
.
3 Optimization of Naïve Algorithm
.
4 Examples & Analysis
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 2 / 25
3. What can we do with suffix tree?
.
Find properties of the strings
Linear algorithms for exact string . Find the longest common substrings of the
1
matching
string Si and Sj in θ (ni + nj ) time .
like KMP . Find all maximal pairs, maximal repeats or
2
Search for strings supermaximal repeats in θ (n + z) time, if there
. Check if a string P of length m is
1 are z such repeats .
. Find the Lempel-Ziv decomposition in θ (n)
3
a substring in O(m) time.
. Find all z occurrences of the
2 time .
. Find the longest repeated substrings in θ (n)
patterns P1 , · · · , Pq of total 4
length m as substrings in time.
. Find the most frequently occurring substrings
5
O(m + z) time.
. Search for a regular expression P
3 of a minimum length in θ (n) time.
. Find the shortest strings from ∑ that do not
6
in time expected sublinear in n .
. Find for each suffix of a pattern
4 occur in D, in O(n + z) time, if there are z such
P, the length of the longest strings.
. Find the shortest substrings occurring only
7
match between a prefix of
P[i . . . m] and a substring in D in once in θ (n) time.
. Find, for each i, the shortest substrings of Si
θ (m) time . 8
. ...
5 not occurring elsewhere in D in θ (n) time.
. ...
9
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 3 / 25
4. Trie, Radix Tree, Suffix Trie & Suffix Tree
trie1 is a dictionary tree.
(AKA prefix tree)
stores a set of words.
each node represents a character except that root is empty string.
words with common prefix share same parent nodes.
minimal deterministic finite automaton that accepts all words.
radix tree is a trie with compressed chain of nodes.
(AKA patricia trie or radix trie)
Each internal node has at least 2 children.
suffix trie is a trie which stores all suffix of a given string.
suffix tree is a suffix radix tree.
that enables linear time construction and fast
algorithms of other problems on a string.
1 pronounced as in word retrieval by its inventor, /tri:/ “tree”, but
pronounced /traI/ “try” by other authors . . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 4 / 25
5. Trie & Radix Tree
.
Trie of “A to tea ted ten i
in inn”
.
.
Radix tree example
.
.
.
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 5 / 25
6. Suffix Trie & Suffix Tree of “banana”
^
b a n $
0:6
a n $ a banana$ a na $
1:0 8:6 6:6 10:6
n a n $
na $ na$ $
a n $ a
4:6 9:6 3:2 7:6
n a $
na$ $
a $
2:1 5:6
$
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 6 / 25
7. Suffix Tree of “mississippi”
mississippi$
0:11
(1,2) (0,...) (8,9) (11,...) (2,3)
i mississi... p $... s
12:8 1:0 15:10 18:11 4:4
(8,...) (2,5) (11,...) (10,...) (9,...)
ppi$... ssi $... i$... pi$...
(3,5) (4,5)
13:8 6:8 17:11 16:10 14:8
si i
(8,...) (5,...)
ppi$... ssippi$...
7:8 2:1 8:8 10:8
(8,...) (5,...) (8,...) (5,...)
ppi$... ssippi$... ppi$... ssippi$...
9:8 3:2 11:8 5:4
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 7 / 25
8. Part Outline
.
1 What is Suffix Tree
.
2 History & Naïve Algorithm
.
3 Optimization of Naïve Algorithm
.
4 Examples & Analysis
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 8 / 25
9. History of Suffix Tree Algorithms
.
M. Farach. Optimal suffix tree
First linear algorithm was introduced by Weiner construction with large
alphabets. In focs, page
1973 as position tree. Awarded by Donald Knuth 137. Published by the IEEE
Computer Society, 1997.
as “Algorithm of the year 1973”. E. M. McCreight. A
space-economical suffix
Greatly simplified by McCreight 1976. tree construction algorithm.
J. ACM, 23:262–272, April
Above two algorithms are processing string backward. 1976. ISSN 0004-5411.
doi: http://doi.acm.org/10.
1145/321941.321946. URL
First online construction by Ukkonen 1995, which http:
//doi.acm.org/10.
is easier to understand. 1145/321941.321946.
E. Ukkonen. On-line
construction of suffix trees.
Above algorithms assume size of alphabet as fixed Algorithmica, 14:249–260,
1995. ISSN 0178-4617.
constant . URL
http://dx.doi.org/10.
Limitation was break by Farach 1997, optimal for 1007/BF01206331.
10.1007/BF01206331.
all alphabets. P. Weiner. Linear pattern
matching algorithms. In
Switching and Automata
Further study are continued to scale to scenarios when Theory, 1973. SWAT ’08.
IEEE Conference Record of
the whole suffix tree or even input string cannot fit into 14th Annual Symposium on,
pages 1 –11, oct. 1973. doi:
memory. . . .
10.1109/SWAT.1973.13.
. . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 9 / 25
10. Backward Construction of Suffix Tree
1 2 3 4
banana$ banana$ anana$ banana$ anana$ nana$ banana$ ana nana$
na$ $
7 6 5
banana$ a na $ banana$ a na banana$ ana na
na $ na$ $ na $ na$ $ na$ $ na$ $
na$ $ na$ $
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 10 / 25
11. Construct Suffix Tree by Sorting Suffix
Suffix : Sorted suffix : Tree of sorted suffix :
mississippi i | -i - >| -
ississippi ippi | | - ppi
ssissippi issippi | | - ssi - >| - ppi
sissippi ississippi | | - ssippi
issippi mississippi | - mississippi
ssippi pi | -p - >| - i
sippi ppi | | - pi
ippi sippi | -s - >| -i - - >| - ppi
ppi sissippi | | - ssippi
pi ssippi | - si - >| - ppi
i ssissippi | - ssippi
Time complexity will be O(N 2 log N) .
Space complexity will be O(N 2 ) .
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 11 / 25
12. Naïve Algorithm
S UFFIX T REE(string)
1 for i ← 1 to length(string)
2 do U PDATE(treei ) £ Phrase i
U PDATE(treei )
1 for j ← 1 to i
2 do node ← treei .F IND(suffix[j to i − 1])
3 E XTEND(node, string[i]) £ Extension j
Time complexity will be O(N 3 ) .
Space complexity will be O(N 2 ) .
The challenge is to make sure treei is updated to treei+1 efficiently.
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 12 / 25
13. Suffix Extend Cases of Naïve Algorithm
.
Case I If path Sj...i ends at leaf, append a char Si+1 to end of edge into leaf.
Sj...i = . . . na Sj...i+1 = . . . nan
na nan
Case II If path Sj...i ends in the middle of an edge , and next char Si+1 is not equal to
the next char in the edge, split that edge, create a internal node, add a new
edge to a new leaf.
Sj...i = . . . na Sj...i+1 = . . . nay
n
nan na
y
Case III If path Sj...i ends in the middle of an edge , and next char Si+1 is equal to the
next char in the edge, do nothing, extenstion has done.
Sj...i = . . . na Sj...i+1 = . . . nan
nan nan
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 13 / 25
14. Naïve Online Construction of Suffix Tree
1 2 3 4
b ba a ban an n bana ana na
7 6 5
banana$ a na $ banana anana nana banan anan nan
na $ na$ $
na$ $
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 14 / 25
15. Properties of Suffix Tree
.
. Each update will add exactly 1
1
leaf node .
0:6
nr_leaf = N
. Suffix tree is full tree.
2
banana$ a na $
Each internal node has at least 2
children. 1:0 8:6 6:6 10:6
nr_internal < N
nr_node < 2N na $ na$ $
. Worst case Fabonacci word
3
4:6 9:6 3:2 7:6
abaababaabaab
. Suffix is either explicit or implicit.
4
na$ $
Explicit when it ends at a node. 2:1 5:6
Implicit when it ends in the
middle of an edge.
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 15 / 25
16. Part Outline
.
1 What is Suffix Tree
.
2 History & Naïve Algorithm
.
3 Optimization of Naïve Algorithm
.
4 Examples & Analysis
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 16 / 25
17. Optimization of Naïve Algorithm
.
. Substrings can be represented as (start, end) pair of their index in
1
orignal string.
Reduce space complexity to O(N) if size of alphabet is fixed constant.
. Once a leaf, Always a leaf
2
Represent edge that links to a leaf as (start, · · · ).
Extend leaf nodes for free. We do not need Extend Case I.
0:6
0:6
banana$ a na $ (1,2) (0,...) (6,...) (2,4)
a banana$... $... na
1:0 8:6 6:6 10:6 8:6 1:0 10:6 6:6
(6,...) (2,4) (6,...) (4,...)
na $ na$ $ $... na $... na$...
4:6 9:6 3:2 7:6 9:6 4:6 7:6 3:2
(6,...) (4,...)
na$ $ $... na$...
5:6 2:1
2:1 5:6 . . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 17 / 25
18. Active Point
During a phrase, if we meet Extend Case III, that is if we found
S[i + 1] already exists in suffix[j . . . i] then S[i + 1] will exists in
∀suffix[k . . . i], k ∈ j . . . i.
Thus Case III is a sign that means update of this phrase is finished.
During phrase i if we stopped at suffix[k . . . i] by Case III, then in
next phrase we can start from suffix[k . . . i + 1] because all suffix
start with 1 . . . k − 1 will end at Case I.
We called this point(current internal node, current position k in
string) as Active Point.
a ab aba abaa abaab abaaba abaabab abaababa abaababaa abaababaab abaababaaba abaababaabaa abaababaaba ab
0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 0:10 0:11 0:12
(0,...) (0,...) (1,...) (0,...) (1,...) (0,1) (1,...) (0,1) (1,...) (0,1) (1,...) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3)
a... ab... b... aba... ba... a baa... a baab... a baaba... a ba a ba a ba a ba a ba a ba a ba
1:0 1:0 2:1 1:0 2:1 3:3 2:1 3:3 2:1 3:3 2:1 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6
(3,...) (1,...) (3,...) (1,...) (3,...) (1,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,6) (1,3) (3,6) (6,...) (3,6) (1,3) (3,6) (6,...)
a... baa... ab... baab... aba... baaba... abab... ba abab... b... ababa... ba ababa... ba... ababaa... ba ababaa... baa... ababaab... ba ababaab... baab... ababaaba... ba ababaaba... baaba... aba ba aba baabaa... aba ba aba baabaab...
4:3 1:0 4:3 1:0 4:3 1:0 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 13:11 5:6 11:11 8:6 13:11 5:6 11:11 8:6
(3,...) (6,...) (3,...) (6,...) (3,...) (6,...) (3,...) (6,...) (3,...) (6,...) (11,...) (6,...) (3,6) (6,...) (11,...) (6,...) (11,...) (6,...) (3,6) (6,...) (11,...) (6,...)
abab... b... ababa... ba... ababaa... baa... ababaab... baab... ababaaba... baaba... a... baabaa... aba baabaa... a... baabaa... ab... baabaab... aba baabaab... ab... baabaab...
1:0 6:6 1:0 6:6 1:0 6:6 1:0 6:6 1:0 6:6 14:11 4:3 9:11 6:6 12:11 2:1 14:11 4:3 9:11 6:6 12:11 2:1
(11,...) (6,...) (11,...) (6,...)
a... baabaa... ab... baabaab...
10:11 1:0 10:11 1:0
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 18 / 25
19. Ukk’s Update using Active Point
U PDATE(treei )
1 current_suffix ← active_point
2 next_char ← string[i]
3 while True
4 do if there exists edge start with next_char
5 then break £ Case III
6 else
7 split current edge if implicit
8 create new leaf with new edge labelled next_char
9 if current_suffix is empty
10 then break
11 else current_suffix ← next shorter suffix
12 active_point ← current_suffix
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 19 / 25
20. Suffix Link to find next shorter suffix
Suffix link
Internal node of suffix X α has a link to node α .
If α is empty, suffix link points to root.
How to create suffix link
Link together every internal nodes that are created by splitting in
same phrase.
banana$ banana$
0:6 0:6
(1,2) (0,...) (6,...) (2,4) (0,...) (1,2) (6,...) (2,4)
a banana$... $... na banana$... a $... na
8:6 1:0 10:6 6:6 1:0 8:6 10:6 6:6
(6,...) (2,4) (6,...) (4,...) (6,...) (2,4) (6,...) (4,...)
$... na $... na$... $... na $... na$...
9:6 4:6 7:6 3:2 9:6 4:6 7:6 3:2
(6,...) (4,...) (6,...) (4,...)
$... na$... $... na$...
5:6 2:1 5:6 2:1
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 20 / 25
21. Fast jump using Suffix Link
Assume we are at Suffix X αβ , whose parent internal node
represent X α .
1. Go back to parent internal node,
2. Jumping follow the node’s suffix link to the node represent Suffix
α
3. Go down to Suffix αβ .
Even jump down in step 3 because we already know length of β .
(Skip/Count trick)
Combining all these tricks we can Extend a character in O(1)
time.
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 21 / 25
22. Part Outline
.
1 What is Suffix Tree
.
2 History & Naïve Algorithm
.
3 Optimization of Naïve Algorithm
.
4 Examples & Analysis
. . . . . .
jc-yang (Osaka-U) SuffixTree November 12, 2011 22 / 25