SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
.
                        Algorithm of Suffix Tree
                           by farseerfc@gmail.com
.

                                   Jiachen Yang1
                           1 Research   Student in Osaka University


                               November 12, 2011




                                                                .     .   .        .      .       .

    jc-yang (Osaka-U)                       SuffixTree                         November 12, 2011       1 / 25
Part Outline


.
1   What is Suffix Tree

.
2   History & Naïve Algorithm

.
3   Optimization of Naïve Algorithm

.
4   Examples & Analysis




                                            .   .   .        .      .       .

      jc-yang (Osaka-U)         SuffixTree               November 12, 2011       2 / 25
What can we do with suffix tree?
.
                                           Find properties of the strings
Linear algorithms for exact string           . Find the longest common substrings of the
                                             1
matching
                                               string Si and Sj in θ (ni + nj ) time .
    like KMP                                 . Find all maximal pairs, maximal repeats or
                                             2

Search for strings                             supermaximal repeats in θ (n + z) time, if there
  . Check if a string P of length m is
  1                                            are z such repeats .
                                             . Find the Lempel-Ziv decomposition in θ (n)
                                             3
    a substring in O(m) time.
  . Find all z occurrences of the
  2                                            time .
                                             . Find the longest repeated substrings in θ (n)
    patterns P1 , · · · , Pq of total        4


    length m as substrings in                  time.
                                             . Find the most frequently occurring substrings
                                             5
    O(m + z) time.
  . Search for a regular expression P
  3                                            of a minimum length in θ (n) time.
                                             . Find the shortest strings from ∑ that do not
                                             6
    in time expected sublinear in n .
  . Find for each suffix of a pattern
  4                                            occur in D, in O(n + z) time, if there are z such
    P, the length of the longest               strings.
                                             . Find the shortest substrings occurring only
                                             7
    match between a prefix of
    P[i . . . m] and a substring in D in       once in θ (n) time.
                                             . Find, for each i, the shortest substrings of Si
    θ (m) time .                             8

  . ...
  5                                            not occurring elsewhere in D in θ (n) time.
                                             . ...
                                             9


                                                               .     .     .        .      .       .

      jc-yang (Osaka-U)                     SuffixTree                          November 12, 2011       3 / 25
Trie, Radix Tree, Suffix Trie & Suffix Tree

         trie1                    is a dictionary tree.
                   (AKA prefix tree)

                             stores a set of words.
                             each node represents a character except that root is empty string.
                             words with common prefix share same parent nodes.
                             minimal deterministic finite automaton that accepts all words.
  radix tree                            is a trie with compressed chain of nodes.
                   (AKA patricia trie or radix trie)

                             Each internal node has at least 2 children.
  suffix trie is a trie which stores all suffix of a given string.
  suffix tree is a suffix radix tree.
                   that enables linear time construction and fast
                   algorithms of other problems on a string.
   1 pronounced as in word retrieval by its inventor, /tri:/ “tree”, but
pronounced /traI/ “try” by other authors                           .   .    .        .      .       .

      jc-yang (Osaka-U)                                SuffixTree                November 12, 2011       4 / 25
Trie & Radix Tree

.
Trie of “A to tea ted ten i
in inn”
.
                                 .
                                 Radix tree example
                                 .




                                 .

.
                                          .   .   .        .      .       .

    jc-yang (Osaka-U)         SuffixTree               November 12, 2011       5 / 25
Suffix Trie & Suffix Tree of “banana”
                   ^




     b        a          n       $
                                                                                0:6


 a       n         $         a                                  banana$ a             na         $


                                                      1:0             8:6                 6:6            10:6
 n       a                   n       $

                                                                na     $                   na$       $
 a       n         $         a
                                                      4:6             9:6                 3:2            7:6

 n       a                   $
                                                      na$   $


 a       $
                                                2:1         5:6



 $


                                                                  .         .         .          .         .    .

     jc-yang (Osaka-U)                   SuffixTree                                         November 12, 2011        6 / 25
Suffix Tree of “mississippi”
                                                    mississippi$

                                                            0:11


                            (1,2)                     (0,...) (8,9)            (11,...)      (2,3)
                              i                     mississi... p                $...          s


                  12:8                        1:0          15:10                  18:11              4:4


            (8,...) (2,5)          (11,...)                   (10,...)         (9,...)
            ppi$... ssi              $...                       i$...          pi$...

                                                                                             (3,5)           (4,5)
  13:8           6:8                 17:11                 16:10                  14:8
                                                                                               si              i

                (8,...)       (5,...)
                ppi$...     ssippi$...


          7:8                2:1                                         8:8                                 10:8


                                                               (8,...) (5,...)                           (8,...)   (5,...)
                                                               ppi$... ssippi$...                        ppi$... ssippi$...


                                                     9:8                 3:2              11:8                  5:4



                                                                                                     .           .            .        .      .       .

     jc-yang (Osaka-U)                                             SuffixTree                                                      November 12, 2011       7 / 25
Part Outline


.
1   What is Suffix Tree

.
2   History & Naïve Algorithm

.
3   Optimization of Naïve Algorithm

.
4   Examples & Analysis




                                            .   .   .        .      .       .

      jc-yang (Osaka-U)         SuffixTree               November 12, 2011       8 / 25
History of Suffix Tree Algorithms
.
                                                           M. Farach. Optimal suffix tree
    First linear algorithm was introduced by Weiner            construction with large
                                                               alphabets. In focs, page
    1973 as position tree. Awarded by Donald Knuth             137. Published by the IEEE
                                                               Computer Society, 1997.
    as “Algorithm of the year 1973”.                       E. M. McCreight. A
                                                               space-economical suffix
    Greatly simplified by McCreight 1976.                       tree construction algorithm.
                                                               J. ACM, 23:262–272, April
Above two algorithms are processing string backward.           1976. ISSN 0004-5411.
                                                               doi: http://doi.acm.org/10.
                                                               1145/321941.321946. URL
    First online construction by Ukkonen 1995, which           http:
                                                               //doi.acm.org/10.
    is easier to understand.                                   1145/321941.321946.
                                                           E. Ukkonen. On-line
                                                               construction of suffix trees.
Above algorithms assume size of alphabet as fixed               Algorithmica, 14:249–260,
                                                               1995. ISSN 0178-4617.
constant .                                                     URL
                                                               http://dx.doi.org/10.
    Limitation was break by Farach 1997, optimal for           1007/BF01206331.
                                                               10.1007/BF01206331.
    all alphabets.                                         P. Weiner. Linear pattern
                                                               matching algorithms. In
                                                               Switching and Automata
Further study are continued to scale to scenarios when         Theory, 1973. SWAT ’08.
                                                               IEEE Conference Record of
the whole suffix tree or even input string cannot fit into       14th Annual Symposium on,
                                                               pages 1 –11, oct. 1973. doi:
memory.                                           .   .    .
                                                               10.1109/SWAT.1973.13.
                                                                    .        .        .

     jc-yang (Osaka-U)             SuffixTree                   November 12, 2011          9 / 25
Backward Construction of Suffix Tree
   1                       2                        3                                4


  banana$            banana$ anana$      banana$ anana$       nana$        banana$ ana nana$



                                                                                    na$ $




                 7                              6                               5


         banana$ a na          $      banana$       a    na           banana$ ana                na



              na $         na$ $         na $           na$ $             na$ $                na$ $


        na$ $                         na$ $


                                                                      .     .            .         .       .     .

       jc-yang (Osaka-U)                        SuffixTree                                    November 12, 2011   10 / 25
Construct Suffix Tree by Sorting Suffix
Suffix :                  Sorted suffix :          Tree of sorted suffix :
mississippi              i                       | -i - >| -
ississippi               ippi                    |       | - ppi
ssissippi                issippi                 |       | - ssi - >| - ppi
sissippi                 ississippi              |                   | - ssippi
issippi                  mississippi             | - mississippi
ssippi                   pi                      | -p - >| - i
sippi                    ppi                     |       | - pi
ippi                     sippi                   | -s - >| -i - - >| - ppi
ppi                      sissippi                        |         | - ssippi
pi                       ssippi                          | - si - >| - ppi
i                        ssissippi                                 | - ssippi
Time complexity will be O(N 2 log N) .
Space complexity will be O(N 2 ) .
                                                        .   .    .         .       .     .

     jc-yang (Osaka-U)               SuffixTree                       November 12, 2011   11 / 25
Naïve Algorithm

S UFFIX T REE(string)
1 for i ← 1 to length(string)
2       do U PDATE(treei )    £ Phrase i

U PDATE(treei )
1 for j ← 1 to i
2       do node ← treei .F IND(suffix[j to i − 1])
3          E XTEND(node, string[i])       £ Extension j

Time complexity will be O(N 3 ) .
Space complexity will be O(N 2 ) .
The challenge is to make sure treei is updated to treei+1 efficiently.
                                                .   .    .         .       .     .

     jc-yang (Osaka-U)           SuffixTree                   November 12, 2011   12 / 25
Suffix Extend Cases of Naïve Algorithm
.
     Case I If path Sj...i ends at leaf, append a char Si+1 to end of edge into leaf.
         Sj...i = . . . na                                       Sj...i+1 = . . . nan

                    na                                                  nan

    Case II If path Sj...i ends in the middle of an edge , and next char Si+1 is not equal to
              the next char in the edge, split that edge, create a internal node, add a new
              edge to a new leaf.
         Sj...i = . . . na                                        Sj...i+1 = . . . nay
                                                                                         n
                    nan                                                 na
                                                                                         y
    Case III If path Sj...i ends in the middle of an edge , and next char Si+1 is equal to the
               next char in the edge, do nothing, extenstion has done.
          Sj...i = . . . na                                      Sj...i+1 = . . . nan
                    nan                                                 nan

                                                              .     .     .         .        .    .

       jc-yang (Osaka-U)                   SuffixTree                          November 12, 2011   13 / 25
Naïve Online Construction of Suffix Tree
   1                   2            3                                  4


       b           ba      a     ban an n                       bana ana na



               7                     6                             5


       banana$ a na        $   banana anana     nana           banan anan nan



           na $        na$ $



       na$ $


                                                       .   .       .         .       .     .

   jc-yang (Osaka-U)                SuffixTree                          November 12, 2011   14 / 25
Properties of Suffix Tree
.
    . Each update will add exactly 1
    1

      leaf node .
                                                                                          0:6
              nr_leaf = N
    . Suffix tree is full tree.
    2
                                                                           banana$ a              na          $
              Each internal node has at least 2
              children.                                      1:0                    8:6             6:6               10:6
              nr_internal < N
              nr_node < 2N                                                 na        $                  na$       $
    . Worst case Fabonacci word
    3
                                                             4:6                    9:6             3:2               7:6
              abaababaabaab
    . Suffix is either explicit or implicit.
    4
                                                             na$       $

              Explicit when it ends at a node.         2:1             5:6
              Implicit when it ends in the
              middle of an edge.

                                                                   .            .         .         .         .        .

        jc-yang (Osaka-U)                  SuffixTree                                          November 12, 2011         15 / 25
Part Outline


.
1   What is Suffix Tree

.
2   History & Naïve Algorithm

.
3   Optimization of Naïve Algorithm

.
4   Examples & Analysis




                                            .   .   .         .       .     .

      jc-yang (Osaka-U)         SuffixTree               November 12, 2011   16 / 25
Optimization of Naïve Algorithm
.
     . Substrings can be represented as (start, end) pair of their index in
     1

       orignal string.
                    Reduce space complexity to O(N) if size of alphabet is fixed constant.
     . Once a leaf, Always a leaf
     2

                    Represent edge that links to a leaf as (start, · · · ).
                    Extend leaf nodes for free. We do not need Extend Case I.
                               0:6
                                                                                                                   0:6


                    banana$ a        na         $                                                    (1,2)        (0,...)  (6,...)           (2,4)
                                                                                                       a        banana$... $...               na

          1:0            8:6          6:6               10:6                      8:6                 1:0                   10:6                    6:6


                                                                                 (6,...)   (2,4)                                                   (6,...)   (4,...)
                    na    $               na$       $                             $...      na                                                      $...     na$...


          4:6            9:6          3:2               7:6                9:6                 4:6                                     7:6                    3:2


                                                                                           (6,...)    (4,...)
          na$   $                                                                           $...      na$...


                                                                                   5:6                  2:1
    2:1         5:6                                                                        .            .            .             .           .              .

           jc-yang (Osaka-U)                                   SuffixTree                                                 November 12, 2011                    17 / 25
Active Point

                                     During a phrase, if we meet Extend Case III, that is if we found
                                     S[i + 1] already exists in suffix[j . . . i] then S[i + 1] will exists in
                                     ∀suffix[k . . . i], k ∈ j . . . i.
                                     Thus Case III is a sign that means update of this phrase is finished.
                                     During phrase i if we stopped at suffix[k . . . i] by Case III, then in
                                     next phrase we can start from suffix[k . . . i + 1] because all suffix
                                     start with 1 . . . k − 1 will end at Case I.
                                     We called this point(current internal node, current position k in
                                     string) as Active Point.
a                    ab                       aba                      abaa                   abaab                    abaaba                                  abaabab                                              abaababa                                              abaababaa                                           abaababaab                                          abaababaaba                                                         abaababaabaa                                                                       abaababaaba ab


0:0               0:1                       0:2                      0:3                      0:4                      0:5                                        0:6                                                   0:7                                                 0:8                                                0:9                                                 0:10                                                                             0:11                                                                          0:12


 (0,...)         (0,...)   (1,...)         (0,...)   (1,...)         (0,1)    (1,...)     (0,1)        (1,...)     (0,1)         (1,...)                       (0,1)      (1,3)                                     (0,1)      (1,3)                                       (0,1)    (1,3)                                     (0,1)     (1,3)                                     (0,1)     (1,3)                                                                   (0,1)       (1,3)                                                            (0,1)         (1,3)
  a...            ab...     b...           aba...     ba...            a      baa...        a         baab...        a          baaba...                         a         ba                                         a         ba                                           a       ba                                         a        ba                                         a        ba                                                                       a          ba                                                                a            ba


1:0        1:0              2:1      1:0              2:1      3:3              2:1     3:3             2:1      3:3             2:1                    3:3               7:6                                 3:3              7:6                                  3:3             7:6                                 3:3             7:6                                 3:3             7:6                                                              3:3                  7:6                                                   3:3                          7:6


                                                                (3,...)       (1,...)    (3,...)       (1,...)    (3,...)     (1,...)             (3,...) (1,3)              (3,...)   (6,...)          (3,...) (1,3)             (3,...)   (6,...)           (3,...) (1,3)             (3,...)   (6,...)       (3,...)  (1,3)            (3,...)     (6,...)       (3,...)  (1,3)             (3,...)     (6,...)                                    (3,6) (1,3)                  (3,6)       (6,...)                             (3,6) (1,3)                        (3,6)         (6,...)
                                                                 a...         baa...      ab...       baab...     aba...     baaba...            abab... ba                 abab...     b...           ababa... ba               ababa...    ba...          ababaa... ba              ababaa...   baa...      ababaab... ba             ababaab...   baab...     ababaaba... ba             ababaaba...   baaba...                                     aba   ba                     aba       baabaa...                             aba ba                             aba        baabaab...


                                                               4:3              1:0     4:3             1:0      4:3             1:0       4:3          5:6               2:1          8:6       4:3          5:6              2:1          8:6       4:3           5:6              2:1          8:6       4:3          5:6                2:1       8:6       4:3         5:6                 2:1       8:6                        13:11                   5:6                11:11           8:6                   13:11             5:6                         11:11           8:6


                                                                                                                                                     (3,...)    (6,...)                                    (3,...) (6,...)                                       (3,...) (6,...)                                     (3,...)   (6,...)                                    (3,...)   (6,...)                                      (11,...) (6,...)         (3,6)      (6,...)       (11,...)     (6,...)            (11,...) (6,...)  (3,6)               (6,...)       (11,...)      (6,...)
                                                                                                                                                    abab...      b...                                     ababa... ba...                                       ababaa... baa...                                    ababaab... baab...                                  ababaaba... baaba...                                        a... baabaa...          aba      baabaa...        a...      baabaa...            ab... baabaab... aba               baabaab...       ab...      baabaab...


                                                                                                                                                 1:0              6:6                                  1:0              6:6                                   1:0             6:6                                 1:0                 6:6                             1:0                 6:6                            14:11        4:3            9:11            6:6           12:11           2:1     14:11         4:3           9:11              6:6           12:11           2:1


                                                                                                                                                                                                                                                                                                                                                                                                                                                (11,...)      (6,...)                                                              (11,...)     (6,...)
                                                                                                                                                                                                                                                                                                                                                                                                                                                  a...       baabaa...                                                              ab...     baabaab...


                                                                                                                                                                                                                                                                                                                                                                                                                                             10:11            1:0                                                              10:11             1:0




                                                                                                                                                                                                                                                                                                                                                                            .                                        .                                .                                        .                               .                                            .

                                     jc-yang (Osaka-U)                                                                                                                                                                                                       SuffixTree                                                                                                                                                                                      November 12, 2011                                                                                                 18 / 25
Ukk’s Update using Active Point

U PDATE(treei )
 1 current_suffix ← active_point
 2 next_char ← string[i]
 3 while True
 4      do if there exists edge start with next_char
 5              then break       £ Case III
 6              else
 7                   split current edge if implicit
 8                   create new leaf with new edge labelled next_char
 9          if current_suffix is empty
10              then break
11              else current_suffix ← next shorter suffix
12 active_point ← current_suffix
                                              .   .   .         .       .     .

     jc-yang (Osaka-U)          SuffixTree                 November 12, 2011   19 / 25
Suffix Link to find next shorter suffix
            Suffix link
                             Internal node of suffix X α has a link to node α .
                             If α is empty, suffix link points to root.
            How to create suffix link
                             Link together every internal nodes that are created by splitting in
                             same phrase.
                            banana$                                                     banana$

                             0:6                                                        0:6


            (1,2) (0,...)  (6,...)               (2,4)                  (0,...)  (1,2)             (6,...)     (2,4)
              a banana$... $...                   na                  banana$...   a                $...        na


 8:6                 1:0               10:6         6:6         1:0              8:6               10:6            6:6


  (6,...)         (2,4)                       (6,...) (4,...)            (6,...) (2,4)                       (6,...) (4,...)
   $...            na                          $... na$...                $... na                             $... na$...


 9:6                 4:6               7:6          3:2         9:6              4:6               7:6             3:2


                  (6,...)    (4,...)                                          (6,...)    (4,...)
                   $...      na$...                                            $...      na$...


            5:6               2:1                                       5:6               2:1
                                                                                                                               .   .   .         .       .     .

             jc-yang (Osaka-U)                                                                SuffixTree                                    November 12, 2011   20 / 25
Fast jump using Suffix Link


   Assume we are at Suffix X αβ , whose parent internal node
   represent X α .
      1. Go back to parent internal node,
      2. Jumping follow the node’s suffix link to the node represent Suffix
         α
      3. Go down to Suffix αβ .

   Even jump down in step 3 because we already know length of β .
   (Skip/Count trick)
   Combining all these tricks we can Extend a character in O(1)
   time.


                                                .    .   .         .       .     .

   jc-yang (Osaka-U)             SuffixTree                   November 12, 2011   21 / 25
Part Outline


.
1   What is Suffix Tree

.
2   History & Naïve Algorithm

.
3   Optimization of Naïve Algorithm

.
4   Examples & Analysis




                                            .   .   .         .       .     .

      jc-yang (Osaka-U)         SuffixTree               November 12, 2011   22 / 25
Experiment – mississippi
.

                        m

                        0:0


                         (0,...)
                          m...


                        1:0




                                               .   .   .         .       .     .

    jc-yang (Osaka-U)              SuffixTree               November 12, 2011   23 / 25
Experiment – mississippi
.
                             mi

                         0:1


                        (1,...)   (0,...)
                          i...     mi...


               2:1                    1:0




                                               .   .   .         .       .     .

    jc-yang (Osaka-U)              SuffixTree               November 12, 2011   23 / 25
Experiment – mississippi
.
                             mis

                              0:2


                        (1,...) (2,...)        (0,...)
                         is... s...            mis...


        2:1                   3:2                     1:0




                                                            .   .   .         .       .     .

    jc-yang (Osaka-U)                     SuffixTree                     November 12, 2011   23 / 25
Experiment – mississippi
.
                            miss

                              0:3


                        (1,...) (2,...)        (0,...)
                        iss... ss...           miss...


        2:1                   3:2                     1:0




                                                            .   .   .         .       .     .

    jc-yang (Osaka-U)                     SuffixTree                     November 12, 2011   23 / 25
Experiment – mississippi
.
                             miss i

                              0:4


                        (1,...) (0,...)        (2,3)
                        issi... missi...         s


         2:1                  1:0                      4:4


                                               (4,...) (3,...)
                                                 i...   si...


                             5:4                       3:2




                                                                 .   .   .         .       .     .

    jc-yang (Osaka-U)                      SuffixTree                         November 12, 2011   23 / 25
Experiment – mississippi
.
                              miss is

                              0:5


                         (1,...) (0,...)          (2,3)
                        issis... missis...          s


         2:1                   1:0                        4:4


                                                 (4,...) (3,...)
                                                  is... sis...


                              5:4                         3:2




                                                                   .   .   .         .       .     .

    jc-yang (Osaka-U)                        SuffixTree                         November 12, 2011   23 / 25
Experiment – mississippi
.
                               miss iss

                             0:6


                         (1,...)   (0,...)          (2,3)
                        ississ... mississ...          s


         2:1                       1:0                      4:4


                                                  (4,...) (3,...)
                                                  iss... siss...


                                5:4                        3:2




                                                                    .   .   .         .       .     .

    jc-yang (Osaka-U)                          SuffixTree                        November 12, 2011   23 / 25
Experiment – mississippi
.
                                mississi

                              0:7


                          (1,...)    (0,...)           (2,3)
                        ississi... mississi...           s


         2:1                        1:0                       4:4


                                                    (4,...) (3,...)
                                                    issi... sissi...


                                 5:4                         3:2




                                                                       .   .   .         .       .     .

    jc-yang (Osaka-U)                            SuffixTree                         November 12, 2011   23 / 25
Experiment – mississippi
.
                                               mississip

                                       0:8


                        (8,...) (1,2)                   (0,...)             (2,3)
                         p...     i                   mississi...             s


    14:8                     12:8                           1:0                      4:4


                   (8,...)     (2,5)
                    p...        ssi

                                                                            (3,5)             (4,5)
    13:8                     6:8
                                                                              si                i

                   (8,...)     (5,...)
                    p...       ssip...


     7:8                     2:1                            8:8                               10:8


                                          (8,...)     (5,...)                       (8,...)      (5,...)
                                           p...       ssip...                        p...        ssip...


           9:8                          3:2                         11:8                       5:4




                                                                                                           .   .   .         .       .     .

    jc-yang (Osaka-U)                                                  SuffixTree                                       November 12, 2011   23 / 25
Experiment – mississippi
.
                                              mississip p

                                       0:9


                        (8,...) (1,2)                   (0,...)             (2,3)
                        pp...     i                   mississi...             s


    14:8                     12:8                           1:0                      4:4


                   (8,...)     (2,5)
                   pp...        ssi

                                                                            (3,5)             (4,5)
    13:8                     6:8
                                                                              si                i

                   (8,...)      (5,...)
                   pp...       ssipp...


     7:8                     2:1                            8:8                               10:8


                                          (8,...)      (5,...)                      (8,...)       (5,...)
                                          pp...       ssipp...                      pp...        ssipp...


           9:8                          3:2                         11:8                       5:4




                                                                                                            .   .   .         .       .     .

    jc-yang (Osaka-U)                                                  SuffixTree                                        November 12, 2011   23 / 25
Experiment – mississippi
.
                                                       mississippi

                                                        0:10


                                              (1,2)       (0,...)         (8,9)             (2,3)
                                                i       mississi...         p                 s


                             12:8              1:0                       15:10                    4:4


               (2,5)                (8,...)                           (9,...) (10,...)
                ssi                 ppi...                             pi...    i...

                                                                                         (3,5)          (4,5)
    6:8                            13:8               14:8                   16:10
                                                                                           si             i

     (8,...)            (5,...)
     ppi...            ssippi...


    7:8                     2:1                                        8:8                                 10:8


                                                                (8,...) (5,...)                         (8,...)    (5,...)
                                                                ppi... ssippi...                        ppi...    ssippi...


                                              9:8                      3:2                 11:8                   5:4




                                                                                                                        .     .   .         .       .     .

      jc-yang (Osaka-U)                                                              SuffixTree                                        November 12, 2011   23 / 25
Experiment – mississippi
.
                                                        mississippi$

                                                                0:11


                                (1,2)                     (0,...) (8,9)            (11,...)      (2,3)
                                  i                     mississi... p                $...          s


                      12:8                        1:0          15:10                  18:11              4:4


                (8,...) (2,5)          (11,...)                   (10,...)         (9,...)
                ppi$... ssi              $...                       i$...          pi$...

                                                                                                 (3,5)         (4,5)
    13:8             6:8                 17:11                 16:10                  14:8
                                                                                                   si            i

                    (8,...)       (5,...)
                    ppi$...     ssippi$...


              7:8                2:1                                         8:8                               10:8


                                                                   (8,...) (5,...)                       (8,...)   (5,...)
                                                                   ppi$... ssippi$...                    ppi$... ssippi$...


                                                         9:8                 3:2              11:8                5:4




                                                                                                                        .     .   .         .       .     .

           jc-yang (Osaka-U)                                                         SuffixTree                                        November 12, 2011   23 / 25
Time Complexity Analysis
$
i
p
p
i
s
s
i
s
s
i
m
 .   m        i          s   s   i   s     s         i   p       p       i          $
Time complexity is 2N = O(N) .
                                                             .       .       .          .      .     .

     jc-yang (Osaka-U)                   SuffixTree                               November 12, 2011   24 / 25
Experiment – English text
.

                           16
                                                                "data" using 1:3
                           14
Construction Time (sec.)




                           12

                           10

                           8

                           6

                           4

                           2

                           0
                                0      20000 40000   60000 80000 100000 120000 140000
                                                     File size (byte)
                                                                             .     .   .         .       .     .

                           jc-yang (Osaka-U)               SuffixTree                       November 12, 2011   25 / 25

Contenu connexe

Plus de Jiachen Yang

データモデルの更新を効率よく検証するの並列可能性
データモデルの更新を効率よく検証するの並列可能性データモデルの更新を効率よく検証するの並列可能性
データモデルの更新を効率よく検証するの並列可能性Jiachen Yang
 
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...Jiachen Yang
 
チェックリストと分割に基づく 網羅と使用テスト
チェックリストと分割に基づく  網羅と使用テストチェックリストと分割に基づく  網羅と使用テスト
チェックリストと分割に基づく 網羅と使用テストJiachen Yang
 
Active Refinement of Clone Anomaly Reports
Active Refinement of Clone Anomaly ReportsActive Refinement of Clone Anomaly Reports
Active Refinement of Clone Anomaly ReportsJiachen Yang
 
Inference and Checking of Object Ownership
Inference  and  Checking  of  Object OwnershipInference  and  Checking  of  Object Ownership
Inference and Checking of Object OwnershipJiachen Yang
 
基于OpenNEbula的虚拟化服务器集群中节能 的研究
基于OpenNEbula的虚拟化服务器集群中节能 的研究基于OpenNEbula的虚拟化服务器集群中节能 的研究
基于OpenNEbula的虚拟化服务器集群中节能 的研究Jiachen Yang
 
Output fica.beamer.43
Output fica.beamer.43Output fica.beamer.43
Output fica.beamer.43Jiachen Yang
 

Plus de Jiachen Yang (8)

データモデルの更新を効率よく検証するの並列可能性
データモデルの更新を効率よく検証するの並列可能性データモデルの更新を効率よく検証するの並列可能性
データモデルの更新を効率よく検証するの並列可能性
 
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
 
チェックリストと分割に基づく 網羅と使用テスト
チェックリストと分割に基づく  網羅と使用テストチェックリストと分割に基づく  網羅と使用テスト
チェックリストと分割に基づく 網羅と使用テスト
 
Active Refinement of Clone Anomaly Reports
Active Refinement of Clone Anomaly ReportsActive Refinement of Clone Anomaly Reports
Active Refinement of Clone Anomaly Reports
 
Inference and Checking of Object Ownership
Inference  and  Checking  of  Object OwnershipInference  and  Checking  of  Object Ownership
Inference and Checking of Object Ownership
 
基于OpenNEbula的虚拟化服务器集群中节能 的研究
基于OpenNEbula的虚拟化服务器集群中节能 的研究基于OpenNEbula的虚拟化服务器集群中节能 的研究
基于OpenNEbula的虚拟化服务器集群中节能 的研究
 
Output fica.beamer.43
Output fica.beamer.43Output fica.beamer.43
Output fica.beamer.43
 
Cloud sim report
Cloud sim reportCloud sim report
Cloud sim report
 

Dernier

Gamestore case study UI UX by Amgad Ibrahim
Gamestore case study UI UX by Amgad IbrahimGamestore case study UI UX by Amgad Ibrahim
Gamestore case study UI UX by Amgad Ibrahimamgadibrahim92
 
Call Girls In Ratnagiri Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service En...
Call Girls In Ratnagiri Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service En...Call Girls In Ratnagiri Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service En...
Call Girls In Ratnagiri Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service En...Nitya salvi
 
How to Create a Productive Workspace Trends and Tips.pdf
How to Create a Productive Workspace Trends and Tips.pdfHow to Create a Productive Workspace Trends and Tips.pdf
How to Create a Productive Workspace Trends and Tips.pdfOffice Furniture Plus - Irving
 
一比一定(购)卡尔顿大学毕业证(CU毕业证)成绩单学位证
一比一定(购)卡尔顿大学毕业证(CU毕业证)成绩单学位证一比一定(购)卡尔顿大学毕业证(CU毕业证)成绩单学位证
一比一定(购)卡尔顿大学毕业证(CU毕业证)成绩单学位证wpkuukw
 
Abortion pills in Riyadh +966572737505 <> buy cytotec <> unwanted kit Saudi A...
Abortion pills in Riyadh +966572737505 <> buy cytotec <> unwanted kit Saudi A...Abortion pills in Riyadh +966572737505 <> buy cytotec <> unwanted kit Saudi A...
Abortion pills in Riyadh +966572737505 <> buy cytotec <> unwanted kit Saudi A...samsungultra782445
 
怎样办理巴斯大学毕业证(Bath毕业证书)成绩单留信认证
怎样办理巴斯大学毕业证(Bath毕业证书)成绩单留信认证怎样办理巴斯大学毕业证(Bath毕业证书)成绩单留信认证
怎样办理巴斯大学毕业证(Bath毕业证书)成绩单留信认证eeanqy
 
一比一定(购)滑铁卢大学毕业证(UW毕业证)成绩单学位证
一比一定(购)滑铁卢大学毕业证(UW毕业证)成绩单学位证一比一定(购)滑铁卢大学毕业证(UW毕业证)成绩单学位证
一比一定(购)滑铁卢大学毕业证(UW毕业证)成绩单学位证wpkuukw
 
Q4-Trends-Networks-Module-3.pdfqquater days sheets123456789
Q4-Trends-Networks-Module-3.pdfqquater days sheets123456789Q4-Trends-Networks-Module-3.pdfqquater days sheets123456789
Q4-Trends-Networks-Module-3.pdfqquater days sheets123456789CristineGraceAcuyan
 
怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证
怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证
怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证eeanqy
 
Furniture & Joinery Details_Designs.pptx
Furniture & Joinery Details_Designs.pptxFurniture & Joinery Details_Designs.pptx
Furniture & Joinery Details_Designs.pptxNikhil Raut
 
Abortion pills in Kuwait 🚚+966505195917 but home delivery available in Kuwait...
Abortion pills in Kuwait 🚚+966505195917 but home delivery available in Kuwait...Abortion pills in Kuwait 🚚+966505195917 but home delivery available in Kuwait...
Abortion pills in Kuwait 🚚+966505195917 but home delivery available in Kuwait...drmarathore
 
一比一原版(ANU毕业证书)澳大利亚国立大学毕业证原件一模一样
一比一原版(ANU毕业证书)澳大利亚国立大学毕业证原件一模一样一比一原版(ANU毕业证书)澳大利亚国立大学毕业证原件一模一样
一比一原版(ANU毕业证书)澳大利亚国立大学毕业证原件一模一样yhavx
 
ab-initio-training basics and architecture
ab-initio-training basics and architectureab-initio-training basics and architecture
ab-initio-training basics and architecturesaipriyacoool
 
Just Call Vip call girls Kasganj Escorts ☎️8617370543 Two shot with one girl ...
Just Call Vip call girls Kasganj Escorts ☎️8617370543 Two shot with one girl ...Just Call Vip call girls Kasganj Escorts ☎️8617370543 Two shot with one girl ...
Just Call Vip call girls Kasganj Escorts ☎️8617370543 Two shot with one girl ...Nitya salvi
 
Top profile Call Girls In eluru [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In eluru [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In eluru [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In eluru [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Raebareli Girl Whatsapp Number 📞 8617370543 | Girls Number for Friendship
Raebareli Girl Whatsapp Number 📞 8617370543 | Girls Number for FriendshipRaebareli Girl Whatsapp Number 📞 8617370543 | Girls Number for Friendship
Raebareli Girl Whatsapp Number 📞 8617370543 | Girls Number for FriendshipNitya salvi
 
一比一原版(WLU毕业证)罗瑞尔大学毕业证成绩单留信学历认证原版一模一样
一比一原版(WLU毕业证)罗瑞尔大学毕业证成绩单留信学历认证原版一模一样一比一原版(WLU毕业证)罗瑞尔大学毕业证成绩单留信学历认证原版一模一样
一比一原版(WLU毕业证)罗瑞尔大学毕业证成绩单留信学历认证原版一模一样awasv46j
 
Eye-Catching Web Design Crafting User Interfaces .docx
Eye-Catching Web Design Crafting User Interfaces .docxEye-Catching Web Design Crafting User Interfaces .docx
Eye-Catching Web Design Crafting User Interfaces .docxMdBokhtiyarHossainNi
 
Top profile Call Girls In fatehgarh [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In fatehgarh [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In fatehgarh [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In fatehgarh [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 

Dernier (20)

Gamestore case study UI UX by Amgad Ibrahim
Gamestore case study UI UX by Amgad IbrahimGamestore case study UI UX by Amgad Ibrahim
Gamestore case study UI UX by Amgad Ibrahim
 
Call Girls In Ratnagiri Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service En...
Call Girls In Ratnagiri Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service En...Call Girls In Ratnagiri Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service En...
Call Girls In Ratnagiri Escorts ☎️8617370543 🔝 💃 Enjoy 24/7 Escort Service En...
 
How to Create a Productive Workspace Trends and Tips.pdf
How to Create a Productive Workspace Trends and Tips.pdfHow to Create a Productive Workspace Trends and Tips.pdf
How to Create a Productive Workspace Trends and Tips.pdf
 
一比一定(购)卡尔顿大学毕业证(CU毕业证)成绩单学位证
一比一定(购)卡尔顿大学毕业证(CU毕业证)成绩单学位证一比一定(购)卡尔顿大学毕业证(CU毕业证)成绩单学位证
一比一定(购)卡尔顿大学毕业证(CU毕业证)成绩单学位证
 
Abortion pills in Riyadh +966572737505 <> buy cytotec <> unwanted kit Saudi A...
Abortion pills in Riyadh +966572737505 <> buy cytotec <> unwanted kit Saudi A...Abortion pills in Riyadh +966572737505 <> buy cytotec <> unwanted kit Saudi A...
Abortion pills in Riyadh +966572737505 <> buy cytotec <> unwanted kit Saudi A...
 
怎样办理巴斯大学毕业证(Bath毕业证书)成绩单留信认证
怎样办理巴斯大学毕业证(Bath毕业证书)成绩单留信认证怎样办理巴斯大学毕业证(Bath毕业证书)成绩单留信认证
怎样办理巴斯大学毕业证(Bath毕业证书)成绩单留信认证
 
一比一定(购)滑铁卢大学毕业证(UW毕业证)成绩单学位证
一比一定(购)滑铁卢大学毕业证(UW毕业证)成绩单学位证一比一定(购)滑铁卢大学毕业证(UW毕业证)成绩单学位证
一比一定(购)滑铁卢大学毕业证(UW毕业证)成绩单学位证
 
Abortion Pills in Oman (+918133066128) Cytotec clinic buy Oman Muscat
Abortion Pills in Oman (+918133066128) Cytotec clinic buy Oman MuscatAbortion Pills in Oman (+918133066128) Cytotec clinic buy Oman Muscat
Abortion Pills in Oman (+918133066128) Cytotec clinic buy Oman Muscat
 
Q4-Trends-Networks-Module-3.pdfqquater days sheets123456789
Q4-Trends-Networks-Module-3.pdfqquater days sheets123456789Q4-Trends-Networks-Module-3.pdfqquater days sheets123456789
Q4-Trends-Networks-Module-3.pdfqquater days sheets123456789
 
怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证
怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证
怎样办理伯明翰大学学院毕业证(Birmingham毕业证书)成绩单留信认证
 
Furniture & Joinery Details_Designs.pptx
Furniture & Joinery Details_Designs.pptxFurniture & Joinery Details_Designs.pptx
Furniture & Joinery Details_Designs.pptx
 
Abortion pills in Kuwait 🚚+966505195917 but home delivery available in Kuwait...
Abortion pills in Kuwait 🚚+966505195917 but home delivery available in Kuwait...Abortion pills in Kuwait 🚚+966505195917 but home delivery available in Kuwait...
Abortion pills in Kuwait 🚚+966505195917 but home delivery available in Kuwait...
 
一比一原版(ANU毕业证书)澳大利亚国立大学毕业证原件一模一样
一比一原版(ANU毕业证书)澳大利亚国立大学毕业证原件一模一样一比一原版(ANU毕业证书)澳大利亚国立大学毕业证原件一模一样
一比一原版(ANU毕业证书)澳大利亚国立大学毕业证原件一模一样
 
ab-initio-training basics and architecture
ab-initio-training basics and architectureab-initio-training basics and architecture
ab-initio-training basics and architecture
 
Just Call Vip call girls Kasganj Escorts ☎️8617370543 Two shot with one girl ...
Just Call Vip call girls Kasganj Escorts ☎️8617370543 Two shot with one girl ...Just Call Vip call girls Kasganj Escorts ☎️8617370543 Two shot with one girl ...
Just Call Vip call girls Kasganj Escorts ☎️8617370543 Two shot with one girl ...
 
Top profile Call Girls In eluru [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In eluru [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In eluru [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In eluru [ 7014168258 ] Call Me For Genuine Models We ...
 
Raebareli Girl Whatsapp Number 📞 8617370543 | Girls Number for Friendship
Raebareli Girl Whatsapp Number 📞 8617370543 | Girls Number for FriendshipRaebareli Girl Whatsapp Number 📞 8617370543 | Girls Number for Friendship
Raebareli Girl Whatsapp Number 📞 8617370543 | Girls Number for Friendship
 
一比一原版(WLU毕业证)罗瑞尔大学毕业证成绩单留信学历认证原版一模一样
一比一原版(WLU毕业证)罗瑞尔大学毕业证成绩单留信学历认证原版一模一样一比一原版(WLU毕业证)罗瑞尔大学毕业证成绩单留信学历认证原版一模一样
一比一原版(WLU毕业证)罗瑞尔大学毕业证成绩单留信学历认证原版一模一样
 
Eye-Catching Web Design Crafting User Interfaces .docx
Eye-Catching Web Design Crafting User Interfaces .docxEye-Catching Web Design Crafting User Interfaces .docx
Eye-Catching Web Design Crafting User Interfaces .docx
 
Top profile Call Girls In fatehgarh [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In fatehgarh [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In fatehgarh [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In fatehgarh [ 7014168258 ] Call Me For Genuine Models...
 

Ukk's Algorithm of Suffix Tree

  • 1. . Algorithm of Suffix Tree by farseerfc@gmail.com . Jiachen Yang1 1 Research Student in Osaka University November 12, 2011 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 1 / 25
  • 2. Part Outline . 1 What is Suffix Tree . 2 History & Naïve Algorithm . 3 Optimization of Naïve Algorithm . 4 Examples & Analysis . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 2 / 25
  • 3. What can we do with suffix tree? . Find properties of the strings Linear algorithms for exact string . Find the longest common substrings of the 1 matching string Si and Sj in θ (ni + nj ) time . like KMP . Find all maximal pairs, maximal repeats or 2 Search for strings supermaximal repeats in θ (n + z) time, if there . Check if a string P of length m is 1 are z such repeats . . Find the Lempel-Ziv decomposition in θ (n) 3 a substring in O(m) time. . Find all z occurrences of the 2 time . . Find the longest repeated substrings in θ (n) patterns P1 , · · · , Pq of total 4 length m as substrings in time. . Find the most frequently occurring substrings 5 O(m + z) time. . Search for a regular expression P 3 of a minimum length in θ (n) time. . Find the shortest strings from ∑ that do not 6 in time expected sublinear in n . . Find for each suffix of a pattern 4 occur in D, in O(n + z) time, if there are z such P, the length of the longest strings. . Find the shortest substrings occurring only 7 match between a prefix of P[i . . . m] and a substring in D in once in θ (n) time. . Find, for each i, the shortest substrings of Si θ (m) time . 8 . ... 5 not occurring elsewhere in D in θ (n) time. . ... 9 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 3 / 25
  • 4. Trie, Radix Tree, Suffix Trie & Suffix Tree trie1 is a dictionary tree. (AKA prefix tree) stores a set of words. each node represents a character except that root is empty string. words with common prefix share same parent nodes. minimal deterministic finite automaton that accepts all words. radix tree is a trie with compressed chain of nodes. (AKA patricia trie or radix trie) Each internal node has at least 2 children. suffix trie is a trie which stores all suffix of a given string. suffix tree is a suffix radix tree. that enables linear time construction and fast algorithms of other problems on a string. 1 pronounced as in word retrieval by its inventor, /tri:/ “tree”, but pronounced /traI/ “try” by other authors . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 4 / 25
  • 5. Trie & Radix Tree . Trie of “A to tea ted ten i in inn” . . Radix tree example . . . . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 5 / 25
  • 6. Suffix Trie & Suffix Tree of “banana” ^ b a n $ 0:6 a n $ a banana$ a na $ 1:0 8:6 6:6 10:6 n a n $ na $ na$ $ a n $ a 4:6 9:6 3:2 7:6 n a $ na$ $ a $ 2:1 5:6 $ . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 6 / 25
  • 7. Suffix Tree of “mississippi” mississippi$ 0:11 (1,2) (0,...) (8,9) (11,...) (2,3) i mississi... p $... s 12:8 1:0 15:10 18:11 4:4 (8,...) (2,5) (11,...) (10,...) (9,...) ppi$... ssi $... i$... pi$... (3,5) (4,5) 13:8 6:8 17:11 16:10 14:8 si i (8,...) (5,...) ppi$... ssippi$... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) ppi$... ssippi$... ppi$... ssippi$... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 7 / 25
  • 8. Part Outline . 1 What is Suffix Tree . 2 History & Naïve Algorithm . 3 Optimization of Naïve Algorithm . 4 Examples & Analysis . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 8 / 25
  • 9. History of Suffix Tree Algorithms . M. Farach. Optimal suffix tree First linear algorithm was introduced by Weiner construction with large alphabets. In focs, page 1973 as position tree. Awarded by Donald Knuth 137. Published by the IEEE Computer Society, 1997. as “Algorithm of the year 1973”. E. M. McCreight. A space-economical suffix Greatly simplified by McCreight 1976. tree construction algorithm. J. ACM, 23:262–272, April Above two algorithms are processing string backward. 1976. ISSN 0004-5411. doi: http://doi.acm.org/10. 1145/321941.321946. URL First online construction by Ukkonen 1995, which http: //doi.acm.org/10. is easier to understand. 1145/321941.321946. E. Ukkonen. On-line construction of suffix trees. Above algorithms assume size of alphabet as fixed Algorithmica, 14:249–260, 1995. ISSN 0178-4617. constant . URL http://dx.doi.org/10. Limitation was break by Farach 1997, optimal for 1007/BF01206331. 10.1007/BF01206331. all alphabets. P. Weiner. Linear pattern matching algorithms. In Switching and Automata Further study are continued to scale to scenarios when Theory, 1973. SWAT ’08. IEEE Conference Record of the whole suffix tree or even input string cannot fit into 14th Annual Symposium on, pages 1 –11, oct. 1973. doi: memory. . . . 10.1109/SWAT.1973.13. . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 9 / 25
  • 10. Backward Construction of Suffix Tree 1 2 3 4 banana$ banana$ anana$ banana$ anana$ nana$ banana$ ana nana$ na$ $ 7 6 5 banana$ a na $ banana$ a na banana$ ana na na $ na$ $ na $ na$ $ na$ $ na$ $ na$ $ na$ $ . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 10 / 25
  • 11. Construct Suffix Tree by Sorting Suffix Suffix : Sorted suffix : Tree of sorted suffix : mississippi i | -i - >| - ississippi ippi | | - ppi ssissippi issippi | | - ssi - >| - ppi sissippi ississippi | | - ssippi issippi mississippi | - mississippi ssippi pi | -p - >| - i sippi ppi | | - pi ippi sippi | -s - >| -i - - >| - ppi ppi sissippi | | - ssippi pi ssippi | - si - >| - ppi i ssissippi | - ssippi Time complexity will be O(N 2 log N) . Space complexity will be O(N 2 ) . . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 11 / 25
  • 12. Naïve Algorithm S UFFIX T REE(string) 1 for i ← 1 to length(string) 2 do U PDATE(treei ) £ Phrase i U PDATE(treei ) 1 for j ← 1 to i 2 do node ← treei .F IND(suffix[j to i − 1]) 3 E XTEND(node, string[i]) £ Extension j Time complexity will be O(N 3 ) . Space complexity will be O(N 2 ) . The challenge is to make sure treei is updated to treei+1 efficiently. . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 12 / 25
  • 13. Suffix Extend Cases of Naïve Algorithm . Case I If path Sj...i ends at leaf, append a char Si+1 to end of edge into leaf. Sj...i = . . . na Sj...i+1 = . . . nan na nan Case II If path Sj...i ends in the middle of an edge , and next char Si+1 is not equal to the next char in the edge, split that edge, create a internal node, add a new edge to a new leaf. Sj...i = . . . na Sj...i+1 = . . . nay n nan na y Case III If path Sj...i ends in the middle of an edge , and next char Si+1 is equal to the next char in the edge, do nothing, extenstion has done. Sj...i = . . . na Sj...i+1 = . . . nan nan nan . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 13 / 25
  • 14. Naïve Online Construction of Suffix Tree 1 2 3 4 b ba a ban an n bana ana na 7 6 5 banana$ a na $ banana anana nana banan anan nan na $ na$ $ na$ $ . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 14 / 25
  • 15. Properties of Suffix Tree . . Each update will add exactly 1 1 leaf node . 0:6 nr_leaf = N . Suffix tree is full tree. 2 banana$ a na $ Each internal node has at least 2 children. 1:0 8:6 6:6 10:6 nr_internal < N nr_node < 2N na $ na$ $ . Worst case Fabonacci word 3 4:6 9:6 3:2 7:6 abaababaabaab . Suffix is either explicit or implicit. 4 na$ $ Explicit when it ends at a node. 2:1 5:6 Implicit when it ends in the middle of an edge. . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 15 / 25
  • 16. Part Outline . 1 What is Suffix Tree . 2 History & Naïve Algorithm . 3 Optimization of Naïve Algorithm . 4 Examples & Analysis . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 16 / 25
  • 17. Optimization of Naïve Algorithm . . Substrings can be represented as (start, end) pair of their index in 1 orignal string. Reduce space complexity to O(N) if size of alphabet is fixed constant. . Once a leaf, Always a leaf 2 Represent edge that links to a leaf as (start, · · · ). Extend leaf nodes for free. We do not need Extend Case I. 0:6 0:6 banana$ a na $ (1,2) (0,...) (6,...) (2,4) a banana$... $... na 1:0 8:6 6:6 10:6 8:6 1:0 10:6 6:6 (6,...) (2,4) (6,...) (4,...) na $ na$ $ $... na $... na$... 4:6 9:6 3:2 7:6 9:6 4:6 7:6 3:2 (6,...) (4,...) na$ $ $... na$... 5:6 2:1 2:1 5:6 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 17 / 25
  • 18. Active Point During a phrase, if we meet Extend Case III, that is if we found S[i + 1] already exists in suffix[j . . . i] then S[i + 1] will exists in ∀suffix[k . . . i], k ∈ j . . . i. Thus Case III is a sign that means update of this phrase is finished. During phrase i if we stopped at suffix[k . . . i] by Case III, then in next phrase we can start from suffix[k . . . i + 1] because all suffix start with 1 . . . k − 1 will end at Case I. We called this point(current internal node, current position k in string) as Active Point. a ab aba abaa abaab abaaba abaabab abaababa abaababaa abaababaab abaababaaba abaababaabaa abaababaaba ab 0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 0:10 0:11 0:12 (0,...) (0,...) (1,...) (0,...) (1,...) (0,1) (1,...) (0,1) (1,...) (0,1) (1,...) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) a... ab... b... aba... ba... a baa... a baab... a baaba... a ba a ba a ba a ba a ba a ba a ba 1:0 1:0 2:1 1:0 2:1 3:3 2:1 3:3 2:1 3:3 2:1 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 (3,...) (1,...) (3,...) (1,...) (3,...) (1,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,6) (1,3) (3,6) (6,...) (3,6) (1,3) (3,6) (6,...) a... baa... ab... baab... aba... baaba... abab... ba abab... b... ababa... ba ababa... ba... ababaa... ba ababaa... baa... ababaab... ba ababaab... baab... ababaaba... ba ababaaba... baaba... aba ba aba baabaa... aba ba aba baabaab... 4:3 1:0 4:3 1:0 4:3 1:0 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 13:11 5:6 11:11 8:6 13:11 5:6 11:11 8:6 (3,...) (6,...) (3,...) (6,...) (3,...) (6,...) (3,...) (6,...) (3,...) (6,...) (11,...) (6,...) (3,6) (6,...) (11,...) (6,...) (11,...) (6,...) (3,6) (6,...) (11,...) (6,...) abab... b... ababa... ba... ababaa... baa... ababaab... baab... ababaaba... baaba... a... baabaa... aba baabaa... a... baabaa... ab... baabaab... aba baabaab... ab... baabaab... 1:0 6:6 1:0 6:6 1:0 6:6 1:0 6:6 1:0 6:6 14:11 4:3 9:11 6:6 12:11 2:1 14:11 4:3 9:11 6:6 12:11 2:1 (11,...) (6,...) (11,...) (6,...) a... baabaa... ab... baabaab... 10:11 1:0 10:11 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 18 / 25
  • 19. Ukk’s Update using Active Point U PDATE(treei ) 1 current_suffix ← active_point 2 next_char ← string[i] 3 while True 4 do if there exists edge start with next_char 5 then break £ Case III 6 else 7 split current edge if implicit 8 create new leaf with new edge labelled next_char 9 if current_suffix is empty 10 then break 11 else current_suffix ← next shorter suffix 12 active_point ← current_suffix . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 19 / 25
  • 20. Suffix Link to find next shorter suffix Suffix link Internal node of suffix X α has a link to node α . If α is empty, suffix link points to root. How to create suffix link Link together every internal nodes that are created by splitting in same phrase. banana$ banana$ 0:6 0:6 (1,2) (0,...) (6,...) (2,4) (0,...) (1,2) (6,...) (2,4) a banana$... $... na banana$... a $... na 8:6 1:0 10:6 6:6 1:0 8:6 10:6 6:6 (6,...) (2,4) (6,...) (4,...) (6,...) (2,4) (6,...) (4,...) $... na $... na$... $... na $... na$... 9:6 4:6 7:6 3:2 9:6 4:6 7:6 3:2 (6,...) (4,...) (6,...) (4,...) $... na$... $... na$... 5:6 2:1 5:6 2:1 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 20 / 25
  • 21. Fast jump using Suffix Link Assume we are at Suffix X αβ , whose parent internal node represent X α . 1. Go back to parent internal node, 2. Jumping follow the node’s suffix link to the node represent Suffix α 3. Go down to Suffix αβ . Even jump down in step 3 because we already know length of β . (Skip/Count trick) Combining all these tricks we can Extend a character in O(1) time. . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 21 / 25
  • 22. Part Outline . 1 What is Suffix Tree . 2 History & Naïve Algorithm . 3 Optimization of Naïve Algorithm . 4 Examples & Analysis . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 22 / 25
  • 23. Experiment – mississippi . m 0:0 (0,...) m... 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 24. Experiment – mississippi . mi 0:1 (1,...) (0,...) i... mi... 2:1 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 25. Experiment – mississippi . mis 0:2 (1,...) (2,...) (0,...) is... s... mis... 2:1 3:2 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 26. Experiment – mississippi . miss 0:3 (1,...) (2,...) (0,...) iss... ss... miss... 2:1 3:2 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 27. Experiment – mississippi . miss i 0:4 (1,...) (0,...) (2,3) issi... missi... s 2:1 1:0 4:4 (4,...) (3,...) i... si... 5:4 3:2 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 28. Experiment – mississippi . miss is 0:5 (1,...) (0,...) (2,3) issis... missis... s 2:1 1:0 4:4 (4,...) (3,...) is... sis... 5:4 3:2 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 29. Experiment – mississippi . miss iss 0:6 (1,...) (0,...) (2,3) ississ... mississ... s 2:1 1:0 4:4 (4,...) (3,...) iss... siss... 5:4 3:2 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 30. Experiment – mississippi . mississi 0:7 (1,...) (0,...) (2,3) ississi... mississi... s 2:1 1:0 4:4 (4,...) (3,...) issi... sissi... 5:4 3:2 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 31. Experiment – mississippi . mississip 0:8 (8,...) (1,2) (0,...) (2,3) p... i mississi... s 14:8 12:8 1:0 4:4 (8,...) (2,5) p... ssi (3,5) (4,5) 13:8 6:8 si i (8,...) (5,...) p... ssip... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) p... ssip... p... ssip... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 32. Experiment – mississippi . mississip p 0:9 (8,...) (1,2) (0,...) (2,3) pp... i mississi... s 14:8 12:8 1:0 4:4 (8,...) (2,5) pp... ssi (3,5) (4,5) 13:8 6:8 si i (8,...) (5,...) pp... ssipp... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) pp... ssipp... pp... ssipp... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 33. Experiment – mississippi . mississippi 0:10 (1,2) (0,...) (8,9) (2,3) i mississi... p s 12:8 1:0 15:10 4:4 (2,5) (8,...) (9,...) (10,...) ssi ppi... pi... i... (3,5) (4,5) 6:8 13:8 14:8 16:10 si i (8,...) (5,...) ppi... ssippi... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) ppi... ssippi... ppi... ssippi... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 34. Experiment – mississippi . mississippi$ 0:11 (1,2) (0,...) (8,9) (11,...) (2,3) i mississi... p $... s 12:8 1:0 15:10 18:11 4:4 (8,...) (2,5) (11,...) (10,...) (9,...) ppi$... ssi $... i$... pi$... (3,5) (4,5) 13:8 6:8 17:11 16:10 14:8 si i (8,...) (5,...) ppi$... ssippi$... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) ppi$... ssippi$... ppi$... ssippi$... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 35. Time Complexity Analysis $ i p p i s s i s s i m . m i s s i s s i p p i $ Time complexity is 2N = O(N) . . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 24 / 25
  • 36. Experiment – English text . 16 "data" using 1:3 14 Construction Time (sec.) 12 10 8 6 4 2 0 0 20000 40000 60000 80000 100000 120000 140000 File size (byte) . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 25 / 25