SlideShare une entreprise Scribd logo
1  sur  71
2011/12/1 ALSIP2011@Takamatsu




          Succinct Data Structure for
        Analyzing Document Collection


                Preferred Infrastructure Inc.
                   Daisuke Okanohara
                    hillbig@preferred.jp
Agenda

   Succinct Data Structures (20 min.)
                                              Succinct data        Full-text
        Rank/Select, Wavelet Tree
                                               Structures           Index
   Full-text Index (30 min.)
        Suffix Array/Tree, BWT                      Compressed Full-
                                                        text Index
   Compressed Full-Text Index (20 min.)
        FM-index                                   Analyzing Document
   Analyzing Document Collection (20 min.)              Collection
        Document Statistics (for mining)
        Top-k query
Background

   Very large document collections appear everywhere
        Twitter : 200 billion tweets per year. (6K tweets per sec.)
        Modern DNA Sequencing
             E.g. Illumina HiSeq 2000 produces 40G bases per day


   Analyzing these large document collections require new data structures
    and algorithms
Notation

   T[0, n] : string T[0]T[1]…T[n]
   T[i, j] : subarray from T[i] to T[j] inclusively
   T[i, j) : subarray from T[i] to T[j-1]

   lg m : log2 m          (# of bits to represent m cases)

   Σ : the alphabet set of string
   σ := |Σ| : the number of alphabets
   n : length of input string
   m : length of query string
Preliminaries
Empirical Entropy (1/2)
   Empirical entropy indicates the compressibility of the string
   Zero-order empirical entropy of T
    H0(T) := ∑c n(c)/n log (n/p(c))
        n(c) : the number of occurrences of c in T
        n : the length of T
        The lower-bound for the average code length of compressor not using
         context information


T = abccbaabcab,
 n(a)=4,
 n(b)=4,
 n(c)=3, H0(T) = 1.57 bit
Empirical Entropy (2/2)

   k-th order empirical entropy of T
   Hk(T) = ∑w∈Σk-1 nw/n H0(Tw) :
        Tw : the characters followed by a substring w
        nw : the number of occurrences of a string w in T.
        The lower-bound for the average code length of compressor using the
         following substring of length k-1 as a context information


   T = abccbaabcab
    Ta=bac Tb=acaa Tc=bcb, H2(T) = 0.33 bit

   Hierarchy of entropy: log σ ≧ H0(T) ≧ H1(T) ≧ … Hk(T)
        Typical English Text, lg σ=8 H0=4.5 , H5=1.7
Succinct Data Structure
Rank/Select Dictionary

For a bit vector B[0, n), B[i] ∈{0, 1}, consider following operations
 Lookupc(B, i) : B[i]
 Rankc(B, i) : The number of c’s in B[0…i)
 Selectc(B, i) : The (i+1)-th occurrence of c in B


Example:
i 0123456789
B 0111011001

    Rank1(B,6)=4                Select0(B,2)=2
The number of 1’s in B[0, 6)    The (2+1)-th occurrence of 0 in B
Rank/Select Dictionary Usage (1/2)
Subset
   For U = {0, 1, 2, …, n-1}, a subset of U, V⊆U can be represented by a bit
    vector B[0, n-1], s.t. B[i] = 1 if i∈V, and B[i]=0 otherwise.
   Select(B, i) is the (i+1)-th smallest element in V,
   Rank(B, i) = |{e|e∈V, e<i}|



    V = {0, 4, 7, 10, 15}

    i   0123456789012345
    B = 1000100100100001
Rank/Select Dictionary Usage (2/2)
Partial Sums
   For a non-negative integer array P[0, m), we concatenate each unary
    codes of P[i] to build a bit array B[0, n)
        An unary code of a non-negative integer p≧0 is 0p1
             E.g. unary codes of 0, 3, 10 are 1, 0001, 00000000001
        From the definition, n=m+∑p[i] and the number of ones is m
   P[i] = Select1(B, i) - Select1(B, i-1).
   Cum[k] = ∑i=0…kP[i]. = Select1(B, k) - k (# of 0’s before (k+1)-th 1)
   Find r s.t. Cum[r] ≦ x < Cum[r+1] by Rank1(B, Select0(B, x))

P = {3, 5, 2, 0, 4, 7}
B = 000100000100110000100000001
Cum[4] = Select(B, 4) – 4 = 14
Data structures for Rank/Select

   Bit vector B[0, n) with m ones can be represented in lg (n, m) bits
        lg (n, m) ≒ m lg2e + m lg (n/m) when m << n
   The implementations of Rank/Select Dictionary are divided into the case
    when the bit vector is dense, sparse, and very sparse.

   Case 1. Dense                          01101011010100110101100101
   Case 2. Sparse (RRR)                    00001001000010001000100010
   Case 3. Very Sparse (SDArray)           00000000000010000000001000
Rank/Select Dictionary
Case 1. Dense
   Divide B into large blocks of length l=(log n)2
   Divide each large block into small blocks of length s=(log n)/2
   For each large block, store L[i]=Rank1(B, il)
   For each small block, store S[i]=Rank1(B, is) - Rank1(B, is / l)
   Build a table P(X) that stores the number of ones in X of length s

   Rank1(B, i) = L[i/l] + S[i/s] + P(B[i/s*s, i))
        three table lookup. O(1) time                           L[1]
   Total size of L, S, P is o(n)
                                                                   +S[4]
   Select can also be supported in O(1) time
    using n+o(n) bits                                                   +P[X]

          In practice, P(X) is often replaced with popcount(X)
Rank/Select Dictionary
Case 2. Sparse (RRR)
   Case 1. is redundant when m << n
   A small block X is represented by the class and the offset.
        The class C(X) is the number of ones in X
             e.g . X=101, C(X) = 2.
        The offset O(X) is the lexicographical rank among C(X)
             e.g. X0=011, X1=101, X2 =110, O(X0)=0, O(X1)=1, O(X2)=2
   Store a pointer of every (log n)2/2 blocks and an offset for each element

B = 111110100101010000100111010
C = 3 2 1 2 1 0 1 3 1
O = 0 2 2 1 1 0 2 2 1
Rank/Select Implementation
Case 2. Sparse (RRR)
   Ida: when m << n, many classes (# of ones in small block) are small, and
    the offsets are also small

   Total size is nH0(B) + o(n) bits. Rank is supported in O(1) time
        We can also support select in O(1) time using the same working space.


   Note: RRR is also efficient when ones and zeros are clustered.
        The following bit is not sparse, but can be efficiently stored by RRR encoding.
    000000100000100011111111110111110111111111100000000010000001


              many 0s                     many 1s                     many 0s
Rank/Select Implementation
Case 3. Very Sparse (SD array)
   Case 2. is not efficient for very sparse case.
        o(n) term cannot be negligible when nH0 is very small
   Let S[i] := select1(B, i) and t = lg (n/m).
   L[i] := the lowest t bits of S[i]
   P[i] := S[i] / 2t
   Since P is weakly increasing integers, use prefix sums to represent H
        Concatenate unary codes of P[i]-P[i-1].


     B=0001011000000001
     S[0, 3] = [3, 5, 6, 15], t = lg(16 / 4) = 2.
     L[0, 3] = [3, 1, 2, 3]
     P[i] = [0, 1, 1, 3], H = 1011001
Rank/Select Implementation
Case 3. Very Sparse (SDArray)
   L is stored explicitly, and H is stored with Rank/Select dictionary.
   Size
     H : 2m bits (H[m] ≦ m)
     L : m log2 (n/m) bits.
   select(B, i) = (select1(H, i)-i)2t + L[i]
        O(1) time
   rank(B, i) = linear search from 1 + rank1(H, select0(H, i))
        expected O(1) time.
Rank/Select for large alphabets

   Consider Rank/Select on a string with many alphabets
   I will explain only wavelet trees here since it is very very useful !

   I will not discuss other topics of rank / select for large alphabets
        Multi-array wavelet trees [Ferragina+ ACM Transaction 07]
        Permutation based approach [Golynski SODA 06]
        Alphabet partitioning [Barbey+ ISSAC 10]
Wavelet Tree

   Wavelet Tree = alphabet tree + bit vectors in each node


    T = ahcbedghcfaehcgd




a    b    c    d    e     f   g    h
Wavelet Tree

   Wavelet Tree = alphabet tree + bit vectors in each node




    ahcbedghcfaehcgd                   Filter characters according to its value
    010010110
                                       For each character c := T[i] (i = 0…n),
                                       we move c to the left child and store 0
    acbdc          hegh                if c is the descendant of the left child.
                                       We move c to the right child and store 1
                                       otherwise.




a    b    c    d    e     f   g    h
Wavelet Tree

   Wavelet Tree = alphabet tree + bit vectors in each node
   Recursively continue this process.
   Store a alphabet tree and bit vectors of each node

    ahcbedghcfaehcgd
    0100101101011010
          0             1
acbdcacd           heghfehg
01011011           10110011

aba       cdccd     efe         hghhg
010       01001     010         10110

a     b    c   d    e       f   g   h
Wavelet Tree (contd.)

   Total bit length is n lg σ bits
           The height is l2σ, and there are n bits at each depth.

            0100101101011010


        01011011          10110011                          Height is lg σ



010             01001      010       10110


    a       b    c    d     e    f    g    n
Wavelet Tree (Lookup)

   Lookup(T, pos)
                                             pos := 9
   e.g. Lookup(T, 9)

         0100101101011010                    b := B[pos] = 0
                                             pos := Rankb(B, pos) = 4
             0              1
                                             b := B[pos] = 0
        01011011        10110011
                                             pos := Rankb(B, pos) = 2


010          01001      010         10110    b := B[pos] = 0


    a    b    c    d    e       f    g   n   Return “c”
Wavelet Tree (Rank)

   Rankc (T, pos)                          ahcbedghcfaehcgd
   e.g. Rankc(T, 9)

         0100101101011010                    b := B[9] = 0
                                             pos := Rank0(B, 0) = 4
             0             1
                                             b := B[4] = 4
        01011011       10110011
                                             pos := Rank1(B, 4) = 2

                                             b := B[2] = 0
010          01001     010         10110     pos := Rank0(B, 2) = 1

    a    b    c    d   e       f    g   n    Return 1
Wavelet Tree (Select)

   Selectc (T, i)
   Example Selectc(T, 1)

         0100101101011010                   i := Select0(B,4) = 8
             0             1

        01011011       10110011             i := Select1(B,2) = 4

                                            i := Select0(B,1) = 2
010          01001     010         10110


    a    b    c    d   e       f    g   n   i=1
Wavelet Tree (Quantile List)                                Wavelet Tree is so called
                                                              double-sorted array

   Quantile(T, p, sp, ep) : Return the (p+1)-th largest value in T[sp, ep)
        E.g. Quantile(T, 5, 4, 12)

          0100101101011010                      # of zeros in B[4,12] is 3
                                                3 < (5+1) and go to right
               0               1
        01011011                                # of zeros in B[1,6] is 3
                          10110011
                                                3 > 2 and go to left

                                                # of zeros in B[0, 2] is 2
010            01001      010          10110    2 = 2 and go to right

    a      b    c    d     e       f    g   n   Return “f”
                                                     43782614
                                                 ahcbedghcfaehcgd
Wavelet Tree (top-k Frequent List )

   FreqList (T, k, sp, ep) : Return k most frequent values in T[sp, ep)
        E.g. FreqList(T, 5, 4, 12)
                                                Greedy Search on the tree.
          0100101101011010
                                                Larger ranges will be examined first.
               0               1

        01011011          10110011              Q := priority queue {# of elems, node}
                                                Q.push {[ep-sp, root]}
                                                while
                                                 q = Q.pop()
010            01001       010         10110     If q is leaf, report q and return
                                                 Q.push([0’s in range, q.left])
                                                 Q.push([1’s in range, q.right])
    a      b    c     d    e       f    g   n   end while
Wavelet Tree (RangeFreq, RangeList)

   RangeList (T, sp, ep, min, max) : Return {T[i]} s.t. min≦ [i]<max, sp≦i<ep
   RangeFreq(T, sp, ep, min, max) : Return |RangeList(T, sp, ep, min, max)|
        E.g. RangeList(T, 4, 12, c, e)

          0123457890123456                     Enumerate nodes that go to between
          0100101101011010                     min and max
               0              1
                                               For RangeList, # of enumerated
        01011011         10110011              nodes is O(log σ + occ)

                                               For RangeFreq, # of enumerated n
                                               odes is O(log σ)
010            01001      010         10110


    a      b    c    d    e       f    g   n
Wavelet Tree Summary

 Input T[0…n), 0≦T[i]<σ
 Many operations can be supported in O(log σ) time using wavelet tree

                           Description                              Time
Lookup                     T[i]                                     O(log σ)
Rankc(T, p)                the number of c in T[0…p)                O(log σ)
Selectc(T, i)              the pos of (i+1)-th c in T               O(log σ)
Quantile(T, s, e, r)       (r+1)-th largest val in T[s, e)          O(log σ)
FreqList(T, k, s, e)       k most frequent values in T[s, e)        Expected O(klog σ)
                                                                    Worst O(σ)
RangeFreq(T, s, e, x, y)   items x ≦ T[i] < y in T[s, e)            O(log σ)

RangeList(T, s, e, x, y)   List T[i] s.t. x ≦ T[i] < y in T[s, e)   O(log σ + occ)
NextValue(T, s, e, x)      smallest T[r] > x, s.t. s ≦ r ≦ e        O(log σ)
Wavelet tree improvements

   Huffman Tree shaped Wavelet Tree (HWT)
        If the alphabet tree corresponds to the Huffman tree for T, the size of the
         wavelet tree becomes nH0 + o(nH0) bits.
        Expected query time is also O(H0) when the query distribution is same as
         in T


   Wavelet Tree for large alphabets
        For large alphabet set (e.g. σ ≒, n) the overhead of alphabet tree, and bit
         vectors could large.
        Concatenate bit vectors of same depth. In traversal of tree, we iteratively
         compute the range of the original bit array.
2D Search using Wavelet Tree

        Given a 2D data set D={(xi, yi)}, find all points in [sx, sy] x [ex, ey]
        By using a wavelet tree, wen can solve this in O(log ymax + occ) time

    Remove columns with no data.
    If there are more than one data in the same column, we copy the column so that
    each column has exactly one data.

     0 1 2 3 4 5 6                      0 1 2 2 4 4 6 6 6
0                      *            0                         *           Store the x information
                                                                          using a bit array
1        *       *                  1     *           *
                                                                          Xb = 101011001100111
2                                   2
3 *          *   *                  3 *       *           *               Since each column has
4                      *            4                             *       one value, we can convert
                                                                          it to the string
5                      *            5                                 *   T =313613046
6            *                      6             *
2D Search using Wavelet Tree (contd.)

   [sx, sy] x [ex, ey] region is found by
        s’ := Select0(X, sx)+1 , e’ := Select0(X, ex)+1
        s’’ := Rank1(X, s’)       e’’ := Rank1(X, e’)
        Perform RangeFreq(s’’, e’’, sy, ey)


         0 1 2 3 4 5 6
     0                         *
     1      *         *
                                                X = 101011001100111
     2
     3 *        *     *                         T =313613046
     4                         *
     5                         *
     6          *
Full-text Index
Full-text Index

   Given a text T[0, n) and a query Q[0, m), support following operations.
   occ(T, Q) : Report the number of occurrence of Q in T
   locate(T, Q) :Report the positions of occurrence of Q in T

 Example
i 0123456789A
T abracadabra

   occ(T, “abra”)=2
locate(T, “abra”)={0, 7}
Suffix Array

   Let Si = T[i, n] be the i-th suffix of T
   Sort {Si}i=0…n by its lexicographical order
   Store sorted index in SA[0…n] in n lg n bits (SSA[i] < SSA[i+1])

 S0    abracadabra$           S11   $                     11
 S1    bracadabra$            S10   a$                    10
 S2    racadabra$             S7    abra$                 7
 S3    acadabra$              S0    abracadabra$          0
 S4    cadabra$               S3    acadabra$             3
 S5    adabra$                S5    adabra$               5
 S6    dabra$                 S8    bra$                  8
 S7    abra$                  S1    bracadabra$           1
 S8    bra$                   S4    cadabra$              4
 S9    ra$                    S6    dabra$                6
 S10   a$                     S9    ra$                   9
 S11   $                      S2    racadabra$            2
Search using SA

    Given a query Q[0…m), find the range [sp, ep) s.t.
    sp := max{i | Q ≦ SSA[i]}                                   Q=“ab”, sp=2, ep=4
    ep := min{i | Q > SSA[i]}                             S11   $
                                                          S10   a$
    Since {SSA[i]} are sorted, sp and ep
                                                          S7    abra$
     can be found by binary search in O(m log n) time     S0    abracadabra$
        Each string comparison require O(m) time.        S3    acadabra$
   The occurrence positions are SA[sp, ep)               S5    adabra$
                                                          S8    bra$
        if ep=sp then Q is not found on T
                                                          S1    bracadabra$
                                                          S4    cadabra$
   Occ(T, Q) : O(m log n) time.                          S6    dabra$
   Locate(T, Q) : O(m log n + Occ) time                  S9    ra$
                                                          S2    racadabra$
Suffix Tree
                                                                      11
                                                                        10                 $
   A trie for all suffixes of T                                      $ bra                    7
        Remove all nodes with only one child.               $
                                                                                   c       c   0
        Store suffix indexes (SA) at leaves
        Store depth information at internal nodes, and           a                    3
                                                                               d
         edge information is recovered from depth and SA .                             5
   Fact: # of internal nodes is at most N-1                      bra
                                                                                   $
                                                                  c                    8
        Proof: Consider inserting a suffix one by one.
         At each insertion, at most one internal node will                         c   1
         be build.                                                d        4

                                                             ra            6
   Many string operations can be supported by using                           $
                                                                                       9
    a suffix tree.
                                                                               c       2
Burrows Wheeler Transform (BWT)

                                      i    SA BWT Suffix
   Convert T into BWT as follows
                                      0    11 a    $
    BWT[i] := T[SA[i] – 1]
                                      1    10 r    a$
       if SA[i] = 0, BWT[i] = T[n]
                                      2    7   d   abra$

   abracdabra$->ard$rcaaaabb         3    0   $   abracadabra$
                                      4    3   r   acadabra$
                                      5    5   c   adabra$
                                      6    8   a   bra$
                                      7    1   a   bracadabra$
                                      8    4   a   cadabra$
                                      9    6   a   dabra$
                                      10   9   b   ra$
                                      11   2   b   racadabra$
Characteristics of BWT

   Reversible transformation
        We can recover T from only BWT
   Easily compressed
        Since similar suffixes tend to have preceding characters, same characters
         tend to gather in BWT. bzip2
   Implicit suffix information
        BWT has same information as suffix arrays, and can be used for full-text
         index (so called FM-index).
Example : Before BWT

When Farmer Oak smiled, the corners of his mouth spread till they were
within an unimportant distance of his ears, his eyes were reduced to chinks,
and diver gingwrinkles appeared round them, extending upon his
countenance like the rays in a rudimentary sketch of the rising sun. His
Christian name was Gabriel, and on working days he was a young man of
sound judgment, easy motions, proper dress, and general good character. On
Sundays he was a man of misty views, rather given to postponing, and
hampered by his best clothes and umbrella : upon the whole, one who felt
himself to occupy morally that vast middle space of Laodiceanneutrality which
lay between the Communion people of the parish and the drunken section, --
that is, he went to church, but yawned privately by the time the con+gegation
reached the Nicene …
Example : After BWT

ooooooioororooorooooooooorooorromrrooomooroooooooormoorooororioooro
ormmmmmuuiiiiiIiuuuuuuuiiiUiiiiiioooooooooooorooooiiiioooioiiiiiiiiiiioiiiiiieuiiiiiiii
iiiiiiiiiouuuuouuUUuuuuuuooouuiooriiiriirriiiiriiiiiiaiiiiioooooooooooooiiiouioiiiioiiu
iiuiiiiiiiiiiiiiiiiiiiiiiioiiiiioiuiiiiiiiiiiiiioiiiiiiiiiiiiioiiiiiiuiiiioiiiiiiiiiiiioiiiiiiiiiioiiiioiiiiiiioiiiaiiiiiiiiiiiiiiiii
oiiiiiioiiiiiiiiiiiiiiiuiiiiiiiiiiiiiiiiiioiiiiiiiioiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiuuuiioiiiiiuiiiiiiiiiiiiiiiiiiiiiiiioiiiiuioi
uiiiiiiioiiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioaoiiiiioioiiiiiiiioooiiiiiooioiiioiiiiiouiiiiiiiiiiiiooiiiiiiiiiiiiiiiiiiiii
iiiiiiiiiioiiiiiiiiiiiiiiiiiiioiooiiiiiiiiiiioiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiiiiiiiiiiiiioiiiuiiiiiiiiiioiiiiiiiiiiiiuoiii
oiiioiiiiiiiiiiiiiiiiiiiiiiuiiiiuuiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiuiuiiiiiuuiiiiiiiiiiiiiiiiiiiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiioiii
iiiioiiiiiiiiiiiiiiiiiiiiioiiiiiiiiioiiiiuiiiioiiiioiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiioiiiiiiuiiiiiiiiiiiiiiiooiiiiiiiiiiiiiiiiii
iioooiiiiiiiioiiiiouiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiiioiiiiiiioiiiiiioiiiiuiuoiiiiiiiiioiiiiiiioiiiiiiiiiiii
iiiiiiiiiiiiiiiioioi….
LF-mapping

                                        i    BWT   Suffix (First char is F)
   Let F[i] = T[SA[i]]
                                        0    a1    $
        First characters of suffixes
                                        1    r     a1$
   Fact : the order of same chars
                                        2    d     a2bra$
    in BWT and F is same
                                        3    $     a3bracadabra$
                                        4    r     a4cadabra$
                                        5    c     a5dabra$
                                        6    a2    bra$
                                        7    a3    bracadabra$
                                        8    a4    cadabra$
                                        9    a5    dabra$
                                        10   b     ra$
                                        11   b     racadabra$
LF-mapping (1/3)

                                        i    BWT   Suffix (First char is F)
   Let F[i] = T[SA[i]]
                                        0    a1
        First characters of suffixes
                                        1    r     a1
   Fact : the order of same chars
                                        2    d     a2
    in BWT and F is same
                                        3    $     a3bracadabra$

   Proof: the order of chars in BWT    4    r     a4
    is determined by the following      5    c     a5
                                                                   Same
    suffixes.                           6    a2
    the order of chars in F is also     7    a3    bracadabra$
    determined by the following         8    a4
    suffixes. These are same !          9    a5
                                        10
                                        11
LF-mapping (2/3)

                                                 i    BWT   Suffix (First char is F)
   Let F[i] = T[SA[i]]
                                                 0    a     $
        First characters of sorted suffixes
                                                 1    r     a$
                                                 2    d     abra$
   Let C[c] be the number of
                                                 3    $     abracadabra$
    characters smaller than c in T
        C[c] = beginning position of F[i] = c   4    r     acadabra$
   For B[i]=c, its corresponding char           5    c     adabra$
    in F[i] is found by Rankc(T, i) + C[c]       6    a     bra$
                                                 7    a     bracadabra$
                                                 8    a     cadabra$
                                                 9    a     dabra$
                                                 10   b     ra$
                                                 11   b     racadabra$
LF-mapping (3/3)

   Let ISA[0, n) s.t. ISA[r]=p if SA[p] = r
        SA[i] converts the rank into the position, and ISA[i] vice versa.
   For c := BWT[r], we define two functions
    LF(r) := Rankc(T, r) + C[c] = ISA[SA[r] -1]
    PSI(r) := Selectc(T, r– C[c]) = ISA[SA[r] + 1]
        LF and PSI are inverse functions each other.
   Using LF and PSI, we can move a text in its position order.
        PSI = forward by one char , LF = backward by one char
   e.g.    T = abracadabra


            SA[LF[r]]   SA[r]   SA[PSI[r]]
Decoding of BWT using LF-mapping

r := beg // the rank of T[0, n), i.e. ISA[0]
while
   Output BWT[r]
   r := PSI(r)
until r != beg

That’s all !

In practice, PSI values are pre-computed using O(n) time
Compressed Full-text Index
Compressed Full-Text Index

   Full-text can be compressed
        Moreover, we can recover original text from compressed index
         (self-index)


   I explain FM-index only here.

   Other important compressed full-text indices
        Compressed Suffix Arrays
        LZ-index
        LZ77-based
FM-index                           P’=ra appear in P            P=a appeared in [1,
                                                                       5)
                                                i      BWT   Suffix
   Using LF-mapping for searching
                                                0      a     $
    pattern.
                                                1      r     a$
                                                2      d     abra$
   Assume that a range [sp, ep)
                                                3      $     abracadabra$
    correspond to a pattern P.
                                                4      r     acadabra$
   Then if a pattern P’ = cP occurs,           5      c     adabra$
    then [sp, ep) should be preceded by c       6      a     bra$
                                                7      a     bracadabra$
                                                8      a     cadabra$
                                                9      a     dabra$
                                                10     b     ra$
                                                11     b     racadabra$
FM-index                           P’=ra appear in P            P=a appeared in [1,
                                                                       5)
                                                i      BWT   Suffix
   Using LF-mapping for searching
                                                0      a     $
    pattern.
                                                1      r1    a$
                                                2      d     abra$
   Assume that a range [sp, ep)
                                                3      $     abracadabra$
    correspond to a pattern P.
                                                4      r2    acadabra$
   Then if a pattern P’ = cP occurs,           5      c     adabra$
    then [sp, ep) is preceded by c              6      a     bra$
                                                7      a     bracadabra$
   The corresponding suffixes can be           8      a     cadabra$
    found by Rankc(BWT, sp) + C[c]              9      a     dabra$
    and Rankc(BWT, ep) + C[c]                   10     b     r1a$
                                                11     b     r2acadabra$
FM-index

Input query : Q[0, m)
Output a range [sp, ep) corresponding to Q

sp := 0, ep := n
for i = m-1 to 0
  c := Q[c]
  sp := C[c] + Rankc(BWT, sp)
  ep := C[c] + Rankc(BWT, ep)
end for
Return (sp, ep)

Occ(T, Q) = ep - sp
Locate(T, Q) = SA[sp, ep)
Storing Sampled SA’s
                                                                 i    SA   R
   Store sampled SAs instead of original SA.                    0         0
   Sample SA[p] s.t. SA[p] = tk for (t=0…)                      1         0
   We also store a bit array R to indicate the index of         2         0
    sampled SAs.                                                 3    0    1
                                                                 4         0
   Recall that LF(r) = ISA[SA[r]-1] for c = BWT[c].             5         0
   If SA[p] is not sampled, we use LF to move on SA             6    8    1
        SA[r] = SA[LF(r)]-1 = SA[LF2(r)]-2 = … = SA[LFk(r)]-k   7         0
   We move on SA until the sampled point is found               8    4    1
    using at most k LF operations.                               9         0
                                                                 10        0
   Set k = log1+εn. The size of A is (n/k) log n = o(n)         11        0
   The lookup of SA[i] requires O(log1+εn logσ) time.
Grammar Compression on SA [R. Gonzalez+ CPM07]

   SA can also be compressed using grammar based compressions
   Let DSA[i] := SA[i] – SA[i-1] (for i > 0)
   If BWT[i] = BWT[i+1] = … = BWT[i+run] then
    DSA[i+1, i+run] = DSA[k+1, k+run] for k := LF[i]
       DSA contains many repetition if BWT has many runs.


   DSA can be compressed using grammar compression
       [R. Gonzelez+ CPM07] used Re-pair but other grammar compressions
        can be used as long as they support fast decompression.
Compression Boosting in FM-index

   The size of FM-index using wavelet trees is nH0 + o(nH0) bits
   We can improve the space to nHK by compression boosting
   Compression Boosting [P. Ferragina+ ACM T 07]
        Divide BWT into blocks using contexts of length k
        Build a separate wavelet tree for each block
   Implicit Compression Boosting [Makinen+ SPIRE 07]
        Wavelet Tree + compressed bit vectors using RRR in Wavelet Tree
         The size is close to optimal: the overhead is at most σk+1 log n, which is
         o(n) when k ≦ ((1-ε)logσn)-1
                                                        b   aaaa..
                                                        b   aabb..           Divide BWT’s
                                                        a   aacc..           using following
                                                        b   aacd..           suffix of length
                                                        d   abcd..
                                                        d   abcc..
Compression Boosting using fixed length block
[Karkkaine+ SPIRE 11]
   Dividing BWT into blocks of fixed length can also achieve nHK
        Block size is σ(log n)2
        To analyze the working space, we calculate the overhead incurred of
         conversion from context-sensitive partition into fixed-length partition


   Note. This encoding is very similar to the encoding used in bzip2
SSA: Huffman WT
AFFMI : Context Sensitive
        Partition of BWT
Fixed Block : Fixed Block

+ RRR : Represent bit vectors
in WT using RRR encoding
Summary of state-of-the-art

   Give an input String T, compute BWT for T (=B).
   Divide B into blocks of fixed length
   Each block is stored in a Huffman-shaped wavelet tree.
   A bit-vector in a wavelet tree is stored using RRR representation.
   For English, 2-3 bits per character, 1 pattern 0.06 msec.
Analyzing Document Collection
Document Query

   DL(D, P) : Document Listing
        List the distinct documents where P appears
   DLF(D, P) :Document Listing with Frequency
        The result of DL(D, P) + frequency of P in each document
   Top(k, D, P) : Top-k Retrieval
        List the k documents where P appears with largest scores S(D, P)
        e.g. S(D, P) : term frequency, document’s score (e.g. Page rank)


   In the previous section we consider hit positions, and here we consider hit
    documents
        We need to convert hit positions into hit documents.
        This is not easy.
Index for document collection

   For a document collection D={d1, d2, …, dD}
   Build T = d1#d2#d3#...#dD and build an full-text index for T
        We also use a bit array to indicate the beginning positions of docs.
        Given a hit position in T, we can convert it to the hit doc in O(1) time.


   Baseline of DL and DLF
        Perform Locate(T, Q) and convert each hit positions into a hit doc each
         independently.
        However, this requires O(occ(T, Q)) time, and occ(T, Q) could be much larger
         than # of hit documents
DocArray and PrevArray (1/2)
[Muthukrishnan SODA 02]

   DocArray : D[0, n] s.t. D[i] is the doc ID of SA[i].
        DL(D, Q) is now to list all distinct values in the hit range D[sp, ep)
   PrevArray : P[0, n] s.t. P[i]=max{j < I, D[j]=D[i]}
        P[i] can be simulated as selectD[i] (D, rankD[i](D, i-1))



                                           i      0     1      2       3    4       5       6
                                           D       66    77     66     88   77      77      88
                                           P       -1    -1        0   -1       1       4       3
DocArray and PrevArray (2/2)

   Alg. DocList reports all distinct doc IDs in [sp, ep) in O(|DL|) time.
        Call DocList(sp, sp, ep)


DocList(gsp, sp, ep)                          i     0      1        2       3    4       5       6
 IF sp=ep                                     D       66       77   66      88    77     88      88
    RETURN;
 (min, pos) := RMQ(P, sp, ep)                 P       -1       -1       0   -1       1       3       5
 IF min ≧ gsp
                                                                    RMQ(P, 2, 7) = -1 (=P[3])
    RETURN;                                                         Report D[3]
 Report D[pos]
 DocList(gsp, sp, pos)
 DocList(gsp, pos+1, ep)

                                    RMQ(P, 2, 3) = 0 (=P[2])
                                                                            RMQ(P, 5, 7) = 3 (=P[5])
                                    Report D[2]
                                                                            Return
Wavelet Tree + Grammar Compression
[Navarro+ SEA 11]

   Build a wavelet tree for D.
   Using Freqlist(k, sp, ep), we can solve DL and DLF in O(log σ) time.
   However storing D requires n lg |D| bits. This is very large.

   As in compression of suffix arrays using grammar compression, D can
    also be compressed by grammar compression
   Apply a grammar compression to bit arrays in wavelet trees, and support
    rank/select operations

   Currently, this is the fastest/smallest approach in practice.
Suffix Tree based approach (1/4)
[W.K. Hon+ FOCS 09]
   Let ST be the generalized suffix tree for a collection of documents
   A leaf l is marked with document d if l corresponds to d
   An internal node l is marked with d if at least two children of v contain
    leaves marked with d
        An internal node can be marked with many values d.


                                   1    c

                      2   a   b    c


          3           7   a        10   c


    4     5   6   8       9   11   12       13   14
    a     c   b   a       a   c    c        b    c
Suffix Tree based approach (2/4)

      In every node v marked with d, we store a pointer ptr(v, d) to its lowest
       ancestor u such that u is also marked with d
           ptr(v, d) also has a weight corresponds to relevance score.
           If such ancestor does not exist, ptr(v, d) points to a super root node.



                                         1    c
ptr(7, a) = 2
                            2   a   b    c
                                                           ptr(13, a) = 2

             3              7   a        10   c


       4        5   6   8       9   11   12       13   14
       a        c   b   a       a   c    c        b    c
Suffix Tree Based Approach (3/4)

   Lemma 1. The total number of pointers ptr(*,*) in T is at most 2n
   Lemma 2. Assume that document d contains a pattern P and v is the node
    corresponding to P. Then there exist a unique pointer ptr(u, d), such that u
    is in the subtree of v and ptr(u, d) points to an ancestor of v



                                1    c
                                              Example: when v = 10. ptr(10, c)=2,
                   2   a   b    c             and ptr(13, b)=2.
                                              These ptrs correspond to
                                     c        the occurrence of P in b and c.
                                10


                           11   12       13
                           c    c        b
Suffix Tree Based Approach (4/4)

   To enumerate document corresponding to the node x, find all ptrs that
    begins in the subtree of x, and points to the ancestor of x.
        These ptrs uniquely corresponds to each hit documents (Lemma)




                    x
Geometric Interpretation
                       pd

pd(v, c) :=        -1           c       a    b
depth of ptr(v, c) 0                                  c                                                     c
                       1                                       a       c   b    a           c           b
place all ptrs in      2                                                            a   a       c   c
a point (x, pd(v,c))
                                1       2    2        2        4       5   6   7    8   9   10 11 12 13 14

                                                                               depth
                                                 1    c                        0

                           2    a       b        c                             1


          3                 7       a        10       c                        2


     4    5    6       8        9       11       12       13       14          3

     a    c    b       a        a       c        c        b        c
Document Query = three-sided 2D query

   When v := locus(Q) d := depth(v), and r be the maximal index assigned to
    v or its descendants. DL can be solve by
    RangeList(v, r, 0, d-1)




       pd
      -1                                v=2
            c   a   b                                     v=10
       0                c                                             c
       1                    a   c   b   a             c           b
       2                                      a   a       c   c
            1   2   2   2   4   5   6   7     8   9   10 11 12 13 14
Top-K query

   Each ptr(v, d) can store any score function depended on Q, D.
   For each node v, f ptr(x, d) points to vTop-K

   Theorem [Navarro+ SODA 12] k most highly weighted points in the range
    [a, b] x [0, h] can be reported in decreasing order of their weights in O(h+k)
    time

   In practice, we can use a priority queue on each node.
Bibliography

[Navarro+ SEA 12] Practical Compressed Document Retrieval, G. Navarro, S.
J. Puglisi, and D. Valenzuela, SEA 2012
[Navarro+ SODA 12] Top-k Document Retrieval in Optimal Time and Linear
Space, SODA 2012
[T. Gagie+ TCS+ 11] New Algorithms on Wavelet Trees and Applications to
Information Retrieval, TCS 2011
[G. Navarro+ 11] Practical Top-K Document Retrieval in Reduced Space
[Karkkaine+ SPIRE 11] Fixed Block Compression Boosting in FM-indexes, J.
Karkkainen, S. J. Puglisi, SPIRE 2011
[W.K. Hon+ FOCS 09] Space-Efficient Framework for Top-k String Retrieval
Problems, W. K. Hon, R. Shah, J. S. Vitter, FOCS 2009
[Navarro+ SEA 11] “Practical Compressed Document Retrieval”, Gonzalo
Navarro , Simon J. Puglisi, and Daniel Valenzuela,, SEA 2011
[Navarro+ SODA 12] Top-k Document Retrieval in Optimal Time and Linear
Space , Gonzalo Navarro, Yakov Nekrich, SODA 2012
[Gagie+ JTCS 11] New Algorithms on Wavelet Trees and Applications to
Information Retrieval, Travis Gagie , Gonzalo Navarro , Simon J. Puglisi, J. of
TCS 2011

Contenu connexe

Tendances

大規模グラフアルゴリズムの最先端
大規模グラフアルゴリズムの最先端大規模グラフアルゴリズムの最先端
大規模グラフアルゴリズムの最先端
Takuya Akiba
 

Tendances (20)

ゼロから始める自然言語処理 【FIT2016チュートリアル】
ゼロから始める自然言語処理 【FIT2016チュートリアル】ゼロから始める自然言語処理 【FIT2016チュートリアル】
ゼロから始める自然言語処理 【FIT2016チュートリアル】
 
オントロジーとは?
オントロジーとは?オントロジーとは?
オントロジーとは?
 
リクルート式 自然言語処理技術の適応事例紹介
リクルート式 自然言語処理技術の適応事例紹介リクルート式 自然言語処理技術の適応事例紹介
リクルート式 自然言語処理技術の適応事例紹介
 
深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向
 
強化学習その2
強化学習その2強化学習その2
強化学習その2
 
東北大学講義資料 実世界における自然言語処理 - すべての人にロボットを - 坪井祐太 
東北大学講義資料 実世界における自然言語処理 - すべての人にロボットを - 坪井祐太 東北大学講義資料 実世界における自然言語処理 - すべての人にロボットを - 坪井祐太 
東北大学講義資料 実世界における自然言語処理 - すべての人にロボットを - 坪井祐太 
 
さらば!データサイエンティスト
さらば!データサイエンティストさらば!データサイエンティスト
さらば!データサイエンティスト
 
大規模グラフアルゴリズムの最先端
大規模グラフアルゴリズムの最先端大規模グラフアルゴリズムの最先端
大規模グラフアルゴリズムの最先端
 
データサイエンティストのつくり方
データサイエンティストのつくり方データサイエンティストのつくり方
データサイエンティストのつくり方
 
Encoder-decoder 翻訳 (TISハンズオン資料)
Encoder-decoder 翻訳 (TISハンズオン資料)Encoder-decoder 翻訳 (TISハンズオン資料)
Encoder-decoder 翻訳 (TISハンズオン資料)
 
10分でわかる主成分分析(PCA)
10分でわかる主成分分析(PCA)10分でわかる主成分分析(PCA)
10分でわかる主成分分析(PCA)
 
機械学習を民主化する取り組み
機械学習を民主化する取り組み機械学習を民主化する取り組み
機械学習を民主化する取り組み
 
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learningゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
 
MS COCO Dataset Introduction
MS COCO Dataset IntroductionMS COCO Dataset Introduction
MS COCO Dataset Introduction
 
[DL輪読会]FOTS: Fast Oriented Text Spotting with a Unified Network
[DL輪読会]FOTS: Fast Oriented Text Spotting with a Unified Network[DL輪読会]FOTS: Fast Oriented Text Spotting with a Unified Network
[DL輪読会]FOTS: Fast Oriented Text Spotting with a Unified Network
 
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
音源分離 ~DNN音源分離の基礎から最新技術まで~ Tokyo bishbash #3
 
人間の意思決定を機械学習でモデル化できるか
人間の意思決定を機械学習でモデル化できるか人間の意思決定を機械学習でモデル化できるか
人間の意思決定を機械学習でモデル化できるか
 
katagaitai CTF勉強会 #5 Crypto
katagaitai CTF勉強会 #5 Cryptokatagaitai CTF勉強会 #5 Crypto
katagaitai CTF勉強会 #5 Crypto
 
[DL輪読会]DIVERSE TRAJECTORY FORECASTING WITH DETERMINANTAL POINT PROCESSES
[DL輪読会]DIVERSE TRAJECTORY FORECASTING WITH DETERMINANTAL POINT PROCESSES[DL輪読会]DIVERSE TRAJECTORY FORECASTING WITH DETERMINANTAL POINT PROCESSES
[DL輪読会]DIVERSE TRAJECTORY FORECASTING WITH DETERMINANTAL POINT PROCESSES
 
高速な倍精度指数関数expの実装
高速な倍精度指数関数expの実装高速な倍精度指数関数expの実装
高速な倍精度指数関数expの実装
 

Similaire à Succinct Data Structure for Analyzing Document Collection

Graph Algorithms Graph Algorithms Graph Algorithms
Graph Algorithms Graph Algorithms Graph AlgorithmsGraph Algorithms Graph Algorithms Graph Algorithms
Graph Algorithms Graph Algorithms Graph Algorithms
htttuneti
 
CS330-Lectures Statistics And Probability
CS330-Lectures Statistics And ProbabilityCS330-Lectures Statistics And Probability
CS330-Lectures Statistics And Probability
bryan111472
 
Applicationof datastructures
Applicationof datastructuresApplicationof datastructures
Applicationof datastructures
Hitesh Wagle
 
Applicationof datastructures
Applicationof datastructuresApplicationof datastructures
Applicationof datastructures
Hitesh Wagle
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithm
Trector Rancor
 

Similaire à Succinct Data Structure for Analyzing Document Collection (20)

Existence of Solutions of Fractional Neutral Integrodifferential Equations wi...
Existence of Solutions of Fractional Neutral Integrodifferential Equations wi...Existence of Solutions of Fractional Neutral Integrodifferential Equations wi...
Existence of Solutions of Fractional Neutral Integrodifferential Equations wi...
 
U0 vqmtq3mja=
U0 vqmtq3mja=U0 vqmtq3mja=
U0 vqmtq3mja=
 
Ch9b
Ch9bCh9b
Ch9b
 
Graph Algorithms Graph Algorithms Graph Algorithms
Graph Algorithms Graph Algorithms Graph AlgorithmsGraph Algorithms Graph Algorithms Graph Algorithms
Graph Algorithms Graph Algorithms Graph Algorithms
 
D021018022
D021018022D021018022
D021018022
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 3
Unit 3Unit 3
Unit 3
 
CS330-Lectures Statistics And Probability
CS330-Lectures Statistics And ProbabilityCS330-Lectures Statistics And Probability
CS330-Lectures Statistics And Probability
 
UNIT-1_CSA.pdf
UNIT-1_CSA.pdfUNIT-1_CSA.pdf
UNIT-1_CSA.pdf
 
Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...
Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...
Lucio Floretta - TensorFlow and Deep Learning without a PhD - Codemotion Mila...
 
5th Semester Electronic and Communication Engineering (2013-June) Question Pa...
5th Semester Electronic and Communication Engineering (2013-June) Question Pa...5th Semester Electronic and Communication Engineering (2013-June) Question Pa...
5th Semester Electronic and Communication Engineering (2013-June) Question Pa...
 
2013-June: 5th Semester E & C Question Papers
2013-June: 5th Semester E & C Question Papers2013-June: 5th Semester E & C Question Papers
2013-June: 5th Semester E & C Question Papers
 
smtlecture.5
smtlecture.5smtlecture.5
smtlecture.5
 
Signals Processing Homework Help
Signals Processing Homework HelpSignals Processing Homework Help
Signals Processing Homework Help
 
Solved problems
Solved problemsSolved problems
Solved problems
 
Longest Common Subsequence
Longest Common SubsequenceLongest Common Subsequence
Longest Common Subsequence
 
On the Fixed Point Extension Results in the Differential Systems of Ordinary ...
On the Fixed Point Extension Results in the Differential Systems of Ordinary ...On the Fixed Point Extension Results in the Differential Systems of Ordinary ...
On the Fixed Point Extension Results in the Differential Systems of Ordinary ...
 
Applicationof datastructures
Applicationof datastructuresApplicationof datastructures
Applicationof datastructures
 
Applicationof datastructures
Applicationof datastructuresApplicationof datastructures
Applicationof datastructures
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithm
 

Plus de Preferred Networks

Plus de Preferred Networks (20)

PodSecurityPolicy からGatekeeper に移行しました / Kubernetes Meetup Tokyo #57
PodSecurityPolicy からGatekeeper に移行しました / Kubernetes Meetup Tokyo #57PodSecurityPolicy からGatekeeper に移行しました / Kubernetes Meetup Tokyo #57
PodSecurityPolicy からGatekeeper に移行しました / Kubernetes Meetup Tokyo #57
 
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
 
Kubernetes + containerd で cgroup v2 に移行したら "failed to create fsnotify watcher...
Kubernetes + containerd で cgroup v2 に移行したら "failed to create fsnotify watcher...Kubernetes + containerd で cgroup v2 に移行したら "failed to create fsnotify watcher...
Kubernetes + containerd で cgroup v2 に移行したら "failed to create fsnotify watcher...
 
深層学習の新しい応用と、 それを支える計算機の進化 - Preferred Networks CEO 西川徹 (SEMICON Japan 2022 Ke...
深層学習の新しい応用と、 それを支える計算機の進化 - Preferred Networks CEO 西川徹 (SEMICON Japan 2022 Ke...深層学習の新しい応用と、 それを支える計算機の進化 - Preferred Networks CEO 西川徹 (SEMICON Japan 2022 Ke...
深層学習の新しい応用と、 それを支える計算機の進化 - Preferred Networks CEO 西川徹 (SEMICON Japan 2022 Ke...
 
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
Kubernetes ControllerをScale-Outさせる方法 / Kubernetes Meetup Tokyo #55
 
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
 
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
最新リリース:Optuna V3の全て - 2022/12/10 Optuna Meetup #2
 
Optuna Dashboardの紹介と設計解説 - 2022/12/10 Optuna Meetup #2
Optuna Dashboardの紹介と設計解説 - 2022/12/10 Optuna Meetup #2Optuna Dashboardの紹介と設計解説 - 2022/12/10 Optuna Meetup #2
Optuna Dashboardの紹介と設計解説 - 2022/12/10 Optuna Meetup #2
 
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
スタートアップが提案する2030年の材料開発 - 2022/11/11 QPARC講演
 
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
 
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
 
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
自然言語処理を 役立てるのはなぜ難しいのか(2022/10/25東大大学院「自然言語処理応用」)
 
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語るKubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
Kubernetes にこれから入るかもしれない注目機能!(2022年11月版) / TechFeed Experts Night #7 〜 コンテナ技術を語る
 
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
 
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
PFNのオンプレ計算機クラスタの取り組み_第55回情報科学若手の会
 
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
 
Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...
Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...
Kubernetes Service Account As Multi-Cloud Identity / Cloud Native Security Co...
 
KubeCon + CloudNativeCon Europe 2022 Recap / Kubernetes Meetup Tokyo #51 / #k...
KubeCon + CloudNativeCon Europe 2022 Recap / Kubernetes Meetup Tokyo #51 / #k...KubeCon + CloudNativeCon Europe 2022 Recap / Kubernetes Meetup Tokyo #51 / #k...
KubeCon + CloudNativeCon Europe 2022 Recap / Kubernetes Meetup Tokyo #51 / #k...
 
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
 
独断と偏見で選んだ Kubernetes 1.24 の注目機能と今後! / Kubernetes Meetup Tokyo 50
独断と偏見で選んだ Kubernetes 1.24 の注目機能と今後! / Kubernetes Meetup Tokyo 50独断と偏見で選んだ Kubernetes 1.24 の注目機能と今後! / Kubernetes Meetup Tokyo 50
独断と偏見で選んだ Kubernetes 1.24 の注目機能と今後! / Kubernetes Meetup Tokyo 50
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Succinct Data Structure for Analyzing Document Collection

  • 1. 2011/12/1 ALSIP2011@Takamatsu Succinct Data Structure for Analyzing Document Collection Preferred Infrastructure Inc. Daisuke Okanohara hillbig@preferred.jp
  • 2. Agenda  Succinct Data Structures (20 min.) Succinct data Full-text  Rank/Select, Wavelet Tree Structures Index  Full-text Index (30 min.)  Suffix Array/Tree, BWT Compressed Full- text Index  Compressed Full-Text Index (20 min.)  FM-index Analyzing Document  Analyzing Document Collection (20 min.) Collection  Document Statistics (for mining)  Top-k query
  • 3. Background  Very large document collections appear everywhere  Twitter : 200 billion tweets per year. (6K tweets per sec.)  Modern DNA Sequencing  E.g. Illumina HiSeq 2000 produces 40G bases per day  Analyzing these large document collections require new data structures and algorithms
  • 4. Notation  T[0, n] : string T[0]T[1]…T[n]  T[i, j] : subarray from T[i] to T[j] inclusively  T[i, j) : subarray from T[i] to T[j-1]  lg m : log2 m (# of bits to represent m cases)  Σ : the alphabet set of string  σ := |Σ| : the number of alphabets  n : length of input string  m : length of query string
  • 5. Preliminaries Empirical Entropy (1/2)  Empirical entropy indicates the compressibility of the string  Zero-order empirical entropy of T H0(T) := ∑c n(c)/n log (n/p(c))  n(c) : the number of occurrences of c in T  n : the length of T  The lower-bound for the average code length of compressor not using context information T = abccbaabcab, n(a)=4, n(b)=4, n(c)=3, H0(T) = 1.57 bit
  • 6. Empirical Entropy (2/2)  k-th order empirical entropy of T  Hk(T) = ∑w∈Σk-1 nw/n H0(Tw) :  Tw : the characters followed by a substring w  nw : the number of occurrences of a string w in T.  The lower-bound for the average code length of compressor using the following substring of length k-1 as a context information  T = abccbaabcab Ta=bac Tb=acaa Tc=bcb, H2(T) = 0.33 bit  Hierarchy of entropy: log σ ≧ H0(T) ≧ H1(T) ≧ … Hk(T)  Typical English Text, lg σ=8 H0=4.5 , H5=1.7
  • 8. Rank/Select Dictionary For a bit vector B[0, n), B[i] ∈{0, 1}, consider following operations  Lookupc(B, i) : B[i]  Rankc(B, i) : The number of c’s in B[0…i)  Selectc(B, i) : The (i+1)-th occurrence of c in B Example: i 0123456789 B 0111011001 Rank1(B,6)=4 Select0(B,2)=2 The number of 1’s in B[0, 6) The (2+1)-th occurrence of 0 in B
  • 9. Rank/Select Dictionary Usage (1/2) Subset  For U = {0, 1, 2, …, n-1}, a subset of U, V⊆U can be represented by a bit vector B[0, n-1], s.t. B[i] = 1 if i∈V, and B[i]=0 otherwise.  Select(B, i) is the (i+1)-th smallest element in V,  Rank(B, i) = |{e|e∈V, e<i}| V = {0, 4, 7, 10, 15} i 0123456789012345 B = 1000100100100001
  • 10. Rank/Select Dictionary Usage (2/2) Partial Sums  For a non-negative integer array P[0, m), we concatenate each unary codes of P[i] to build a bit array B[0, n)  An unary code of a non-negative integer p≧0 is 0p1  E.g. unary codes of 0, 3, 10 are 1, 0001, 00000000001  From the definition, n=m+∑p[i] and the number of ones is m  P[i] = Select1(B, i) - Select1(B, i-1).  Cum[k] = ∑i=0…kP[i]. = Select1(B, k) - k (# of 0’s before (k+1)-th 1)  Find r s.t. Cum[r] ≦ x < Cum[r+1] by Rank1(B, Select0(B, x)) P = {3, 5, 2, 0, 4, 7} B = 000100000100110000100000001 Cum[4] = Select(B, 4) – 4 = 14
  • 11. Data structures for Rank/Select  Bit vector B[0, n) with m ones can be represented in lg (n, m) bits  lg (n, m) ≒ m lg2e + m lg (n/m) when m << n  The implementations of Rank/Select Dictionary are divided into the case when the bit vector is dense, sparse, and very sparse.  Case 1. Dense 01101011010100110101100101  Case 2. Sparse (RRR) 00001001000010001000100010  Case 3. Very Sparse (SDArray) 00000000000010000000001000
  • 12. Rank/Select Dictionary Case 1. Dense  Divide B into large blocks of length l=(log n)2  Divide each large block into small blocks of length s=(log n)/2  For each large block, store L[i]=Rank1(B, il)  For each small block, store S[i]=Rank1(B, is) - Rank1(B, is / l)  Build a table P(X) that stores the number of ones in X of length s  Rank1(B, i) = L[i/l] + S[i/s] + P(B[i/s*s, i))  three table lookup. O(1) time L[1]  Total size of L, S, P is o(n) +S[4]  Select can also be supported in O(1) time using n+o(n) bits +P[X] In practice, P(X) is often replaced with popcount(X)
  • 13. Rank/Select Dictionary Case 2. Sparse (RRR)  Case 1. is redundant when m << n  A small block X is represented by the class and the offset.  The class C(X) is the number of ones in X  e.g . X=101, C(X) = 2.  The offset O(X) is the lexicographical rank among C(X)  e.g. X0=011, X1=101, X2 =110, O(X0)=0, O(X1)=1, O(X2)=2  Store a pointer of every (log n)2/2 blocks and an offset for each element B = 111110100101010000100111010 C = 3 2 1 2 1 0 1 3 1 O = 0 2 2 1 1 0 2 2 1
  • 14. Rank/Select Implementation Case 2. Sparse (RRR)  Ida: when m << n, many classes (# of ones in small block) are small, and the offsets are also small  Total size is nH0(B) + o(n) bits. Rank is supported in O(1) time  We can also support select in O(1) time using the same working space.  Note: RRR is also efficient when ones and zeros are clustered.  The following bit is not sparse, but can be efficiently stored by RRR encoding. 000000100000100011111111110111110111111111100000000010000001 many 0s many 1s many 0s
  • 15. Rank/Select Implementation Case 3. Very Sparse (SD array)  Case 2. is not efficient for very sparse case.  o(n) term cannot be negligible when nH0 is very small  Let S[i] := select1(B, i) and t = lg (n/m).  L[i] := the lowest t bits of S[i]  P[i] := S[i] / 2t  Since P is weakly increasing integers, use prefix sums to represent H  Concatenate unary codes of P[i]-P[i-1]. B=0001011000000001 S[0, 3] = [3, 5, 6, 15], t = lg(16 / 4) = 2. L[0, 3] = [3, 1, 2, 3] P[i] = [0, 1, 1, 3], H = 1011001
  • 16. Rank/Select Implementation Case 3. Very Sparse (SDArray)  L is stored explicitly, and H is stored with Rank/Select dictionary.  Size H : 2m bits (H[m] ≦ m) L : m log2 (n/m) bits.  select(B, i) = (select1(H, i)-i)2t + L[i]  O(1) time  rank(B, i) = linear search from 1 + rank1(H, select0(H, i))  expected O(1) time.
  • 17. Rank/Select for large alphabets  Consider Rank/Select on a string with many alphabets  I will explain only wavelet trees here since it is very very useful !  I will not discuss other topics of rank / select for large alphabets  Multi-array wavelet trees [Ferragina+ ACM Transaction 07]  Permutation based approach [Golynski SODA 06]  Alphabet partitioning [Barbey+ ISSAC 10]
  • 18. Wavelet Tree  Wavelet Tree = alphabet tree + bit vectors in each node T = ahcbedghcfaehcgd a b c d e f g h
  • 19. Wavelet Tree  Wavelet Tree = alphabet tree + bit vectors in each node ahcbedghcfaehcgd Filter characters according to its value 010010110 For each character c := T[i] (i = 0…n), we move c to the left child and store 0 acbdc hegh if c is the descendant of the left child. We move c to the right child and store 1 otherwise. a b c d e f g h
  • 20. Wavelet Tree  Wavelet Tree = alphabet tree + bit vectors in each node  Recursively continue this process.  Store a alphabet tree and bit vectors of each node ahcbedghcfaehcgd 0100101101011010 0 1 acbdcacd heghfehg 01011011 10110011 aba cdccd efe hghhg 010 01001 010 10110 a b c d e f g h
  • 21. Wavelet Tree (contd.)  Total bit length is n lg σ bits  The height is l2σ, and there are n bits at each depth. 0100101101011010 01011011 10110011 Height is lg σ 010 01001 010 10110 a b c d e f g n
  • 22. Wavelet Tree (Lookup)  Lookup(T, pos) pos := 9  e.g. Lookup(T, 9) 0100101101011010 b := B[pos] = 0 pos := Rankb(B, pos) = 4 0 1 b := B[pos] = 0 01011011 10110011 pos := Rankb(B, pos) = 2 010 01001 010 10110 b := B[pos] = 0 a b c d e f g n Return “c”
  • 23. Wavelet Tree (Rank)  Rankc (T, pos) ahcbedghcfaehcgd  e.g. Rankc(T, 9) 0100101101011010 b := B[9] = 0 pos := Rank0(B, 0) = 4 0 1 b := B[4] = 4 01011011 10110011 pos := Rank1(B, 4) = 2 b := B[2] = 0 010 01001 010 10110 pos := Rank0(B, 2) = 1 a b c d e f g n Return 1
  • 24. Wavelet Tree (Select)  Selectc (T, i)  Example Selectc(T, 1) 0100101101011010 i := Select0(B,4) = 8 0 1 01011011 10110011 i := Select1(B,2) = 4 i := Select0(B,1) = 2 010 01001 010 10110 a b c d e f g n i=1
  • 25. Wavelet Tree (Quantile List) Wavelet Tree is so called double-sorted array  Quantile(T, p, sp, ep) : Return the (p+1)-th largest value in T[sp, ep) E.g. Quantile(T, 5, 4, 12) 0100101101011010 # of zeros in B[4,12] is 3 3 < (5+1) and go to right 0 1 01011011 # of zeros in B[1,6] is 3 10110011 3 > 2 and go to left # of zeros in B[0, 2] is 2 010 01001 010 10110 2 = 2 and go to right a b c d e f g n Return “f” 43782614 ahcbedghcfaehcgd
  • 26. Wavelet Tree (top-k Frequent List )  FreqList (T, k, sp, ep) : Return k most frequent values in T[sp, ep) E.g. FreqList(T, 5, 4, 12) Greedy Search on the tree. 0100101101011010 Larger ranges will be examined first. 0 1 01011011 10110011 Q := priority queue {# of elems, node} Q.push {[ep-sp, root]} while q = Q.pop() 010 01001 010 10110 If q is leaf, report q and return Q.push([0’s in range, q.left]) Q.push([1’s in range, q.right]) a b c d e f g n end while
  • 27. Wavelet Tree (RangeFreq, RangeList)  RangeList (T, sp, ep, min, max) : Return {T[i]} s.t. min≦ [i]<max, sp≦i<ep  RangeFreq(T, sp, ep, min, max) : Return |RangeList(T, sp, ep, min, max)| E.g. RangeList(T, 4, 12, c, e) 0123457890123456 Enumerate nodes that go to between 0100101101011010 min and max 0 1 For RangeList, # of enumerated 01011011 10110011 nodes is O(log σ + occ) For RangeFreq, # of enumerated n odes is O(log σ) 010 01001 010 10110 a b c d e f g n
  • 28. Wavelet Tree Summary Input T[0…n), 0≦T[i]<σ Many operations can be supported in O(log σ) time using wavelet tree Description Time Lookup T[i] O(log σ) Rankc(T, p) the number of c in T[0…p) O(log σ) Selectc(T, i) the pos of (i+1)-th c in T O(log σ) Quantile(T, s, e, r) (r+1)-th largest val in T[s, e) O(log σ) FreqList(T, k, s, e) k most frequent values in T[s, e) Expected O(klog σ) Worst O(σ) RangeFreq(T, s, e, x, y) items x ≦ T[i] < y in T[s, e) O(log σ) RangeList(T, s, e, x, y) List T[i] s.t. x ≦ T[i] < y in T[s, e) O(log σ + occ) NextValue(T, s, e, x) smallest T[r] > x, s.t. s ≦ r ≦ e O(log σ)
  • 29. Wavelet tree improvements  Huffman Tree shaped Wavelet Tree (HWT)  If the alphabet tree corresponds to the Huffman tree for T, the size of the wavelet tree becomes nH0 + o(nH0) bits.  Expected query time is also O(H0) when the query distribution is same as in T  Wavelet Tree for large alphabets  For large alphabet set (e.g. σ ≒, n) the overhead of alphabet tree, and bit vectors could large.  Concatenate bit vectors of same depth. In traversal of tree, we iteratively compute the range of the original bit array.
  • 30. 2D Search using Wavelet Tree  Given a 2D data set D={(xi, yi)}, find all points in [sx, sy] x [ex, ey]  By using a wavelet tree, wen can solve this in O(log ymax + occ) time Remove columns with no data. If there are more than one data in the same column, we copy the column so that each column has exactly one data. 0 1 2 3 4 5 6 0 1 2 2 4 4 6 6 6 0 * 0 * Store the x information using a bit array 1 * * 1 * * Xb = 101011001100111 2 2 3 * * * 3 * * * Since each column has 4 * 4 * one value, we can convert it to the string 5 * 5 * T =313613046 6 * 6 *
  • 31. 2D Search using Wavelet Tree (contd.)  [sx, sy] x [ex, ey] region is found by  s’ := Select0(X, sx)+1 , e’ := Select0(X, ex)+1  s’’ := Rank1(X, s’) e’’ := Rank1(X, e’)  Perform RangeFreq(s’’, e’’, sy, ey) 0 1 2 3 4 5 6 0 * 1 * * X = 101011001100111 2 3 * * * T =313613046 4 * 5 * 6 *
  • 33. Full-text Index  Given a text T[0, n) and a query Q[0, m), support following operations.  occ(T, Q) : Report the number of occurrence of Q in T  locate(T, Q) :Report the positions of occurrence of Q in T  Example i 0123456789A T abracadabra occ(T, “abra”)=2 locate(T, “abra”)={0, 7}
  • 34. Suffix Array  Let Si = T[i, n] be the i-th suffix of T  Sort {Si}i=0…n by its lexicographical order  Store sorted index in SA[0…n] in n lg n bits (SSA[i] < SSA[i+1]) S0 abracadabra$ S11 $ 11 S1 bracadabra$ S10 a$ 10 S2 racadabra$ S7 abra$ 7 S3 acadabra$ S0 abracadabra$ 0 S4 cadabra$ S3 acadabra$ 3 S5 adabra$ S5 adabra$ 5 S6 dabra$ S8 bra$ 8 S7 abra$ S1 bracadabra$ 1 S8 bra$ S4 cadabra$ 4 S9 ra$ S6 dabra$ 6 S10 a$ S9 ra$ 9 S11 $ S2 racadabra$ 2
  • 35. Search using SA  Given a query Q[0…m), find the range [sp, ep) s.t. sp := max{i | Q ≦ SSA[i]} Q=“ab”, sp=2, ep=4 ep := min{i | Q > SSA[i]} S11 $ S10 a$  Since {SSA[i]} are sorted, sp and ep S7 abra$ can be found by binary search in O(m log n) time S0 abracadabra$  Each string comparison require O(m) time. S3 acadabra$  The occurrence positions are SA[sp, ep) S5 adabra$ S8 bra$  if ep=sp then Q is not found on T S1 bracadabra$ S4 cadabra$  Occ(T, Q) : O(m log n) time. S6 dabra$  Locate(T, Q) : O(m log n + Occ) time S9 ra$ S2 racadabra$
  • 36. Suffix Tree 11 10 $  A trie for all suffixes of T $ bra 7  Remove all nodes with only one child. $ c c 0  Store suffix indexes (SA) at leaves  Store depth information at internal nodes, and a 3 d edge information is recovered from depth and SA . 5  Fact: # of internal nodes is at most N-1 bra $ c 8  Proof: Consider inserting a suffix one by one. At each insertion, at most one internal node will c 1 be build. d 4 ra 6  Many string operations can be supported by using $ 9 a suffix tree. c 2
  • 37. Burrows Wheeler Transform (BWT) i SA BWT Suffix  Convert T into BWT as follows 0 11 a $ BWT[i] := T[SA[i] – 1] 1 10 r a$  if SA[i] = 0, BWT[i] = T[n] 2 7 d abra$  abracdabra$->ard$rcaaaabb 3 0 $ abracadabra$ 4 3 r acadabra$ 5 5 c adabra$ 6 8 a bra$ 7 1 a bracadabra$ 8 4 a cadabra$ 9 6 a dabra$ 10 9 b ra$ 11 2 b racadabra$
  • 38. Characteristics of BWT  Reversible transformation  We can recover T from only BWT  Easily compressed  Since similar suffixes tend to have preceding characters, same characters tend to gather in BWT. bzip2  Implicit suffix information  BWT has same information as suffix arrays, and can be used for full-text index (so called FM-index).
  • 39. Example : Before BWT When Farmer Oak smiled, the corners of his mouth spread till they were within an unimportant distance of his ears, his eyes were reduced to chinks, and diver gingwrinkles appeared round them, extending upon his countenance like the rays in a rudimentary sketch of the rising sun. His Christian name was Gabriel, and on working days he was a young man of sound judgment, easy motions, proper dress, and general good character. On Sundays he was a man of misty views, rather given to postponing, and hampered by his best clothes and umbrella : upon the whole, one who felt himself to occupy morally that vast middle space of Laodiceanneutrality which lay between the Communion people of the parish and the drunken section, -- that is, he went to church, but yawned privately by the time the con+gegation reached the Nicene …
  • 40. Example : After BWT ooooooioororooorooooooooorooorromrrooomooroooooooormoorooororioooro ormmmmmuuiiiiiIiuuuuuuuiiiUiiiiiioooooooooooorooooiiiioooioiiiiiiiiiiioiiiiiieuiiiiiiii iiiiiiiiiouuuuouuUUuuuuuuooouuiooriiiriirriiiiriiiiiiaiiiiioooooooooooooiiiouioiiiioiiu iiuiiiiiiiiiiiiiiiiiiiiiiioiiiiioiuiiiiiiiiiiiiioiiiiiiiiiiiiioiiiiiiuiiiioiiiiiiiiiiiioiiiiiiiiiioiiiioiiiiiiioiiiaiiiiiiiiiiiiiiiii oiiiiiioiiiiiiiiiiiiiiiuiiiiiiiiiiiiiiiiiioiiiiiiiioiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiuuuiioiiiiiuiiiiiiiiiiiiiiiiiiiiiiiioiiiiuioi uiiiiiiioiiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioaoiiiiioioiiiiiiiioooiiiiiooioiiioiiiiiouiiiiiiiiiiiiooiiiiiiiiiiiiiiiiiiiii iiiiiiiiiioiiiiiiiiiiiiiiiiiiioiooiiiiiiiiiiioiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiiiiiiiiiiiiioiiiuiiiiiiiiiioiiiiiiiiiiiiuoiii oiiioiiiiiiiiiiiiiiiiiiiiiiuiiiiuuiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiuiuiiiiiuuiiiiiiiiiiiiiiiiiiiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiioiii iiiioiiiiiiiiiiiiiiiiiiiiioiiiiiiiiioiiiiuiiiioiiiioiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiioiiiiiiuiiiiiiiiiiiiiiiooiiiiiiiiiiiiiiiiii iioooiiiiiiiioiiiiouiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiiioiiiiiiioiiiiiioiiiiuiuoiiiiiiiiioiiiiiiioiiiiiiiiiiii iiiiiiiiiiiiiiiioioi….
  • 41. LF-mapping i BWT Suffix (First char is F)  Let F[i] = T[SA[i]] 0 a1 $  First characters of suffixes 1 r a1$  Fact : the order of same chars 2 d a2bra$ in BWT and F is same 3 $ a3bracadabra$ 4 r a4cadabra$ 5 c a5dabra$ 6 a2 bra$ 7 a3 bracadabra$ 8 a4 cadabra$ 9 a5 dabra$ 10 b ra$ 11 b racadabra$
  • 42. LF-mapping (1/3) i BWT Suffix (First char is F)  Let F[i] = T[SA[i]] 0 a1  First characters of suffixes 1 r a1  Fact : the order of same chars 2 d a2 in BWT and F is same 3 $ a3bracadabra$  Proof: the order of chars in BWT 4 r a4 is determined by the following 5 c a5 Same suffixes. 6 a2 the order of chars in F is also 7 a3 bracadabra$ determined by the following 8 a4 suffixes. These are same ! 9 a5 10 11
  • 43. LF-mapping (2/3) i BWT Suffix (First char is F)  Let F[i] = T[SA[i]] 0 a $  First characters of sorted suffixes 1 r a$ 2 d abra$  Let C[c] be the number of 3 $ abracadabra$ characters smaller than c in T  C[c] = beginning position of F[i] = c 4 r acadabra$  For B[i]=c, its corresponding char 5 c adabra$ in F[i] is found by Rankc(T, i) + C[c] 6 a bra$ 7 a bracadabra$ 8 a cadabra$ 9 a dabra$ 10 b ra$ 11 b racadabra$
  • 44. LF-mapping (3/3)  Let ISA[0, n) s.t. ISA[r]=p if SA[p] = r  SA[i] converts the rank into the position, and ISA[i] vice versa.  For c := BWT[r], we define two functions LF(r) := Rankc(T, r) + C[c] = ISA[SA[r] -1] PSI(r) := Selectc(T, r– C[c]) = ISA[SA[r] + 1]  LF and PSI are inverse functions each other.  Using LF and PSI, we can move a text in its position order.  PSI = forward by one char , LF = backward by one char  e.g. T = abracadabra SA[LF[r]] SA[r] SA[PSI[r]]
  • 45. Decoding of BWT using LF-mapping r := beg // the rank of T[0, n), i.e. ISA[0] while Output BWT[r] r := PSI(r) until r != beg That’s all ! In practice, PSI values are pre-computed using O(n) time
  • 47. Compressed Full-Text Index  Full-text can be compressed  Moreover, we can recover original text from compressed index (self-index)  I explain FM-index only here.  Other important compressed full-text indices  Compressed Suffix Arrays  LZ-index  LZ77-based
  • 48. FM-index P’=ra appear in P P=a appeared in [1, 5) i BWT Suffix  Using LF-mapping for searching 0 a $ pattern. 1 r a$ 2 d abra$  Assume that a range [sp, ep) 3 $ abracadabra$ correspond to a pattern P. 4 r acadabra$  Then if a pattern P’ = cP occurs, 5 c adabra$ then [sp, ep) should be preceded by c 6 a bra$ 7 a bracadabra$ 8 a cadabra$ 9 a dabra$ 10 b ra$ 11 b racadabra$
  • 49. FM-index P’=ra appear in P P=a appeared in [1, 5) i BWT Suffix  Using LF-mapping for searching 0 a $ pattern. 1 r1 a$ 2 d abra$  Assume that a range [sp, ep) 3 $ abracadabra$ correspond to a pattern P. 4 r2 acadabra$  Then if a pattern P’ = cP occurs, 5 c adabra$ then [sp, ep) is preceded by c 6 a bra$ 7 a bracadabra$  The corresponding suffixes can be 8 a cadabra$ found by Rankc(BWT, sp) + C[c] 9 a dabra$ and Rankc(BWT, ep) + C[c] 10 b r1a$ 11 b r2acadabra$
  • 50. FM-index Input query : Q[0, m) Output a range [sp, ep) corresponding to Q sp := 0, ep := n for i = m-1 to 0 c := Q[c] sp := C[c] + Rankc(BWT, sp) ep := C[c] + Rankc(BWT, ep) end for Return (sp, ep) Occ(T, Q) = ep - sp Locate(T, Q) = SA[sp, ep)
  • 51. Storing Sampled SA’s i SA R  Store sampled SAs instead of original SA. 0 0  Sample SA[p] s.t. SA[p] = tk for (t=0…) 1 0  We also store a bit array R to indicate the index of 2 0 sampled SAs. 3 0 1 4 0  Recall that LF(r) = ISA[SA[r]-1] for c = BWT[c]. 5 0  If SA[p] is not sampled, we use LF to move on SA 6 8 1  SA[r] = SA[LF(r)]-1 = SA[LF2(r)]-2 = … = SA[LFk(r)]-k 7 0  We move on SA until the sampled point is found 8 4 1 using at most k LF operations. 9 0 10 0  Set k = log1+εn. The size of A is (n/k) log n = o(n) 11 0  The lookup of SA[i] requires O(log1+εn logσ) time.
  • 52. Grammar Compression on SA [R. Gonzalez+ CPM07]  SA can also be compressed using grammar based compressions  Let DSA[i] := SA[i] – SA[i-1] (for i > 0)  If BWT[i] = BWT[i+1] = … = BWT[i+run] then DSA[i+1, i+run] = DSA[k+1, k+run] for k := LF[i]  DSA contains many repetition if BWT has many runs.  DSA can be compressed using grammar compression  [R. Gonzelez+ CPM07] used Re-pair but other grammar compressions can be used as long as they support fast decompression.
  • 53. Compression Boosting in FM-index  The size of FM-index using wavelet trees is nH0 + o(nH0) bits  We can improve the space to nHK by compression boosting  Compression Boosting [P. Ferragina+ ACM T 07]  Divide BWT into blocks using contexts of length k  Build a separate wavelet tree for each block  Implicit Compression Boosting [Makinen+ SPIRE 07]  Wavelet Tree + compressed bit vectors using RRR in Wavelet Tree The size is close to optimal: the overhead is at most σk+1 log n, which is o(n) when k ≦ ((1-ε)logσn)-1 b aaaa.. b aabb.. Divide BWT’s a aacc.. using following b aacd.. suffix of length d abcd.. d abcc..
  • 54. Compression Boosting using fixed length block [Karkkaine+ SPIRE 11]  Dividing BWT into blocks of fixed length can also achieve nHK  Block size is σ(log n)2  To analyze the working space, we calculate the overhead incurred of conversion from context-sensitive partition into fixed-length partition  Note. This encoding is very similar to the encoding used in bzip2
  • 55. SSA: Huffman WT AFFMI : Context Sensitive Partition of BWT Fixed Block : Fixed Block + RRR : Represent bit vectors in WT using RRR encoding
  • 56. Summary of state-of-the-art  Give an input String T, compute BWT for T (=B).  Divide B into blocks of fixed length  Each block is stored in a Huffman-shaped wavelet tree.  A bit-vector in a wavelet tree is stored using RRR representation.  For English, 2-3 bits per character, 1 pattern 0.06 msec.
  • 58. Document Query  DL(D, P) : Document Listing  List the distinct documents where P appears  DLF(D, P) :Document Listing with Frequency  The result of DL(D, P) + frequency of P in each document  Top(k, D, P) : Top-k Retrieval  List the k documents where P appears with largest scores S(D, P)  e.g. S(D, P) : term frequency, document’s score (e.g. Page rank)  In the previous section we consider hit positions, and here we consider hit documents  We need to convert hit positions into hit documents.  This is not easy.
  • 59. Index for document collection  For a document collection D={d1, d2, …, dD}  Build T = d1#d2#d3#...#dD and build an full-text index for T  We also use a bit array to indicate the beginning positions of docs.  Given a hit position in T, we can convert it to the hit doc in O(1) time.  Baseline of DL and DLF  Perform Locate(T, Q) and convert each hit positions into a hit doc each independently.  However, this requires O(occ(T, Q)) time, and occ(T, Q) could be much larger than # of hit documents
  • 60. DocArray and PrevArray (1/2) [Muthukrishnan SODA 02]  DocArray : D[0, n] s.t. D[i] is the doc ID of SA[i].  DL(D, Q) is now to list all distinct values in the hit range D[sp, ep)  PrevArray : P[0, n] s.t. P[i]=max{j < I, D[j]=D[i]}  P[i] can be simulated as selectD[i] (D, rankD[i](D, i-1)) i 0 1 2 3 4 5 6 D 66 77 66 88 77 77 88 P -1 -1 0 -1 1 4 3
  • 61. DocArray and PrevArray (2/2)  Alg. DocList reports all distinct doc IDs in [sp, ep) in O(|DL|) time.  Call DocList(sp, sp, ep) DocList(gsp, sp, ep) i 0 1 2 3 4 5 6 IF sp=ep D 66 77 66 88 77 88 88 RETURN; (min, pos) := RMQ(P, sp, ep) P -1 -1 0 -1 1 3 5 IF min ≧ gsp RMQ(P, 2, 7) = -1 (=P[3]) RETURN; Report D[3] Report D[pos] DocList(gsp, sp, pos) DocList(gsp, pos+1, ep) RMQ(P, 2, 3) = 0 (=P[2]) RMQ(P, 5, 7) = 3 (=P[5]) Report D[2] Return
  • 62. Wavelet Tree + Grammar Compression [Navarro+ SEA 11]  Build a wavelet tree for D.  Using Freqlist(k, sp, ep), we can solve DL and DLF in O(log σ) time.  However storing D requires n lg |D| bits. This is very large.  As in compression of suffix arrays using grammar compression, D can also be compressed by grammar compression  Apply a grammar compression to bit arrays in wavelet trees, and support rank/select operations  Currently, this is the fastest/smallest approach in practice.
  • 63. Suffix Tree based approach (1/4) [W.K. Hon+ FOCS 09]  Let ST be the generalized suffix tree for a collection of documents  A leaf l is marked with document d if l corresponds to d  An internal node l is marked with d if at least two children of v contain leaves marked with d  An internal node can be marked with many values d. 1 c 2 a b c 3 7 a 10 c 4 5 6 8 9 11 12 13 14 a c b a a c c b c
  • 64. Suffix Tree based approach (2/4)  In every node v marked with d, we store a pointer ptr(v, d) to its lowest ancestor u such that u is also marked with d  ptr(v, d) also has a weight corresponds to relevance score.  If such ancestor does not exist, ptr(v, d) points to a super root node. 1 c ptr(7, a) = 2 2 a b c ptr(13, a) = 2 3 7 a 10 c 4 5 6 8 9 11 12 13 14 a c b a a c c b c
  • 65. Suffix Tree Based Approach (3/4)  Lemma 1. The total number of pointers ptr(*,*) in T is at most 2n  Lemma 2. Assume that document d contains a pattern P and v is the node corresponding to P. Then there exist a unique pointer ptr(u, d), such that u is in the subtree of v and ptr(u, d) points to an ancestor of v 1 c Example: when v = 10. ptr(10, c)=2, 2 a b c and ptr(13, b)=2. These ptrs correspond to c the occurrence of P in b and c. 10 11 12 13 c c b
  • 66. Suffix Tree Based Approach (4/4)  To enumerate document corresponding to the node x, find all ptrs that begins in the subtree of x, and points to the ancestor of x.  These ptrs uniquely corresponds to each hit documents (Lemma) x
  • 67. Geometric Interpretation pd pd(v, c) := -1 c a b depth of ptr(v, c) 0 c c 1 a c b a c b place all ptrs in 2 a a c c a point (x, pd(v,c)) 1 2 2 2 4 5 6 7 8 9 10 11 12 13 14 depth 1 c 0 2 a b c 1 3 7 a 10 c 2 4 5 6 8 9 11 12 13 14 3 a c b a a c c b c
  • 68. Document Query = three-sided 2D query  When v := locus(Q) d := depth(v), and r be the maximal index assigned to v or its descendants. DL can be solve by RangeList(v, r, 0, d-1) pd -1 v=2 c a b v=10 0 c c 1 a c b a c b 2 a a c c 1 2 2 2 4 5 6 7 8 9 10 11 12 13 14
  • 69. Top-K query  Each ptr(v, d) can store any score function depended on Q, D.  For each node v, f ptr(x, d) points to vTop-K  Theorem [Navarro+ SODA 12] k most highly weighted points in the range [a, b] x [0, h] can be reported in decreasing order of their weights in O(h+k) time  In practice, we can use a priority queue on each node.
  • 70. Bibliography [Navarro+ SEA 12] Practical Compressed Document Retrieval, G. Navarro, S. J. Puglisi, and D. Valenzuela, SEA 2012 [Navarro+ SODA 12] Top-k Document Retrieval in Optimal Time and Linear Space, SODA 2012 [T. Gagie+ TCS+ 11] New Algorithms on Wavelet Trees and Applications to Information Retrieval, TCS 2011 [G. Navarro+ 11] Practical Top-K Document Retrieval in Reduced Space [Karkkaine+ SPIRE 11] Fixed Block Compression Boosting in FM-indexes, J. Karkkainen, S. J. Puglisi, SPIRE 2011 [W.K. Hon+ FOCS 09] Space-Efficient Framework for Top-k String Retrieval Problems, W. K. Hon, R. Shah, J. S. Vitter, FOCS 2009
  • 71. [Navarro+ SEA 11] “Practical Compressed Document Retrieval”, Gonzalo Navarro , Simon J. Puglisi, and Daniel Valenzuela,, SEA 2011 [Navarro+ SODA 12] Top-k Document Retrieval in Optimal Time and Linear Space , Gonzalo Navarro, Yakov Nekrich, SODA 2012 [Gagie+ JTCS 11] New Algorithms on Wavelet Trees and Applications to Information Retrieval, Travis Gagie , Gonzalo Navarro , Simon J. Puglisi, J. of TCS 2011