Succinct Data Structure for Analyzing Document Collection

2011/12/1 ALSIP2011@Takamatsu

Succinct Data Structure for
Analyzing Document Collection

Preferred Infrastructure Inc.
Daisuke Okanohara
hillbig@preferred.jp

Agenda

 Succinct Data Structures (20 min.)
Succinct data Full-text
 Rank/Select, Wavelet Tree
Structures Index
 Full-text Index (30 min.)
 Suffix Array/Tree, BWT Compressed Full-
text Index
 Compressed Full-Text Index (20 min.)
 FM-index Analyzing Document
 Analyzing Document Collection (20 min.) Collection
 Document Statistics (for mining)
 Top-k query

Background

 Very large document collections appear everywhere
 Twitter : 200 billion tweets per year. (6K tweets per sec.)
 Modern DNA Sequencing
 E.g. Illumina HiSeq 2000 produces 40G bases per day

 Analyzing these large document collections require new data structures
and algorithms

Notation

 T[0, n] : string T[0]T[1]…T[n]
 T[i, j] : subarray from T[i] to T[j] inclusively
 T[i, j) : subarray from T[i] to T[j-1]

 lg m : log2 m (# of bits to represent m cases)

 Σ : the alphabet set of string
 σ := |Σ| : the number of alphabets
 n : length of input string
 m : length of query string

Preliminaries
Empirical Entropy (1/2)
 Empirical entropy indicates the compressibility of the string
 Zero-order empirical entropy of T
H0(T) := ∑c n(c)/n log (n/p(c))
 n(c) : the number of occurrences of c in T
 n : the length of T
 The lower-bound for the average code length of compressor not using
context information

T = abccbaabcab,
n(a)=4,
n(b)=4,
n(c)=3, H0(T) = 1.57 bit

Empirical Entropy (2/2)

 k-th order empirical entropy of T
 Hk(T) = ∑w∈Σk-1 nw/n H0(Tw) :
 Tw : the characters followed by a substring w
 nw : the number of occurrences of a string w in T.
 The lower-bound for the average code length of compressor using the
following substring of length k-1 as a context information

 T = abccbaabcab
Ta=bac Tb=acaa Tc=bcb, H2(T) = 0.33 bit

 Hierarchy of entropy: log σ ≧ H0(T) ≧ H1(T) ≧ … Hk(T)
 Typical English Text, lg σ=8 H0=4.5 , H5=1.7

Rank/Select Dictionary

For a bit vector B[0, n), B[i] ∈{0, 1}, consider following operations
 Lookupc(B, i) : B[i]
 Rankc(B, i) : The number of c’s in B[0…i)
 Selectc(B, i) : The (i+1)-th occurrence of c in B

Example:
i 0123456789
B 0111011001

Rank1(B,6)=4 Select0(B,2)=2
The number of 1’s in B[0, 6) The (2+1)-th occurrence of 0 in B

Rank/Select Dictionary Usage (1/2)
Subset
 For U = {0, 1, 2, …, n-1}, a subset of U, V⊆U can be represented by a bit
vector B[0, n-1], s.t. B[i] = 1 if i∈V, and B[i]=0 otherwise.
 Select(B, i) is the (i+1)-th smallest element in V,
 Rank(B, i) = |{e|e∈V, e<i}|

V = {0, 4, 7, 10, 15}

i 0123456789012345
B = 1000100100100001

Rank/Select Dictionary Usage (2/2)
Partial Sums
 For a non-negative integer array P[0, m), we concatenate each unary
codes of P[i] to build a bit array B[0, n)
 An unary code of a non-negative integer p≧0 is 0p1
 E.g. unary codes of 0, 3, 10 are 1, 0001, 00000000001
 From the definition, n=m+∑p[i] and the number of ones is m
 P[i] = Select1(B, i) - Select1(B, i-1).
 Cum[k] = ∑i=0…kP[i]. = Select1(B, k) - k (# of 0’s before (k+1)-th 1)
 Find r s.t. Cum[r] ≦ x < Cum[r+1] by Rank1(B, Select0(B, x))

P = {3, 5, 2, 0, 4, 7}
B = 000100000100110000100000001
Cum[4] = Select(B, 4) – 4 = 14

Data structures for Rank/Select

 Bit vector B[0, n) with m ones can be represented in lg (n, m) bits
 lg (n, m) ≒ m lg2e + m lg (n/m) when m << n
 The implementations of Rank/Select Dictionary are divided into the case
when the bit vector is dense, sparse, and very sparse.

 Case 1. Dense 01101011010100110101100101
 Case 2. Sparse (RRR) 00001001000010001000100010
 Case 3. Very Sparse (SDArray) 00000000000010000000001000

Case 1. Dense
 Divide B into large blocks of length l=(log n)2
 Divide each large block into small blocks of length s=(log n)/2
 For each large block, store L[i]=Rank1(B, il)
 For each small block, store S[i]=Rank1(B, is) - Rank1(B, is / l)
 Build a table P(X) that stores the number of ones in X of length s

 Rank1(B, i) = L[i/l] + S[i/s] + P(B[i/s*s, i))
 three table lookup. O(1) time L[1]
 Total size of L, S, P is o(n)
+S[4]
 Select can also be supported in O(1) time
using n+o(n) bits +P[X]

In practice, P(X) is often replaced with popcount(X)

Case 2. Sparse (RRR)
 Case 1. is redundant when m << n
 A small block X is represented by the class and the offset.
 The class C(X) is the number of ones in X
 e.g . X=101, C(X) = 2.
 The offset O(X) is the lexicographical rank among C(X)
 e.g. X0=011, X1=101, X2 =110, O(X0)=0, O(X1)=1, O(X2)=2
 Store a pointer of every (log n)2/2 blocks and an offset for each element

B = 111110100101010000100111010
C = 3 2 1 2 1 0 1 3 1
O = 0 2 2 1 1 0 2 2 1

Rank/Select Implementation
Case 2. Sparse (RRR)
 Ida: when m << n, many classes (# of ones in small block) are small, and
the offsets are also small

 Total size is nH0(B) + o(n) bits. Rank is supported in O(1) time
 We can also support select in O(1) time using the same working space.

 Note: RRR is also efficient when ones and zeros are clustered.
 The following bit is not sparse, but can be efficiently stored by RRR encoding.
000000100000100011111111110111110111111111100000000010000001

many 0s many 1s many 0s

Case 3. Very Sparse (SD array)
 Case 2. is not efficient for very sparse case.
 o(n) term cannot be negligible when nH0 is very small
 Let S[i] := select1(B, i) and t = lg (n/m).
 L[i] := the lowest t bits of S[i]
 P[i] := S[i] / 2t
 Since P is weakly increasing integers, use prefix sums to represent H
 Concatenate unary codes of P[i]-P[i-1].

B=0001011000000001
S[0, 3] = [3, 5, 6, 15], t = lg(16 / 4) = 2.
L[0, 3] = [3, 1, 2, 3]
P[i] = [0, 1, 1, 3], H = 1011001

Case 3. Very Sparse (SDArray)
 L is stored explicitly, and H is stored with Rank/Select dictionary.
 Size
H : 2m bits (H[m] ≦ m)
L : m log2 (n/m) bits.
 select(B, i) = (select1(H, i)-i)2t + L[i]
 O(1) time
 rank(B, i) = linear search from 1 + rank1(H, select0(H, i))
 expected O(1) time.

Rank/Select for large alphabets

 Consider Rank/Select on a string with many alphabets
 I will explain only wavelet trees here since it is very very useful !

 I will not discuss other topics of rank / select for large alphabets
 Multi-array wavelet trees [Ferragina+ ACM Transaction 07]
 Permutation based approach [Golynski SODA 06]
 Alphabet partitioning [Barbey+ ISSAC 10]

Wavelet Tree

 Wavelet Tree = alphabet tree + bit vectors in each node

T = ahcbedghcfaehcgd

a b c d e f g h

Wavelet Tree


ahcbedghcfaehcgd Filter characters according to its value
010010110
For each character c := T[i] (i = 0…n),
we move c to the left child and store 0
acbdc hegh if c is the descendant of the left child.
We move c to the right child and store 1
otherwise.

a b c d e f g h

Wavelet Tree

 Recursively continue this process.
 Store a alphabet tree and bit vectors of each node

ahcbedghcfaehcgd
0100101101011010
0 1
acbdcacd heghfehg
01011011 10110011

aba cdccd efe hghhg
010 01001 010 10110

a b c d e f g h

Wavelet Tree (contd.)

 Total bit length is n lg σ bits
 The height is l2σ, and there are n bits at each depth.

0100101101011010

01011011 10110011 Height is lg σ

010 01001 010 10110

a b c d e f g n

Wavelet Tree (Lookup)

 Lookup(T, pos)
pos := 9
 e.g. Lookup(T, 9)

0100101101011010 b := B[pos] = 0
pos := Rankb(B, pos) = 4
0 1
b := B[pos] = 0
01011011 10110011
pos := Rankb(B, pos) = 2

010 01001 010 10110 b := B[pos] = 0

a b c d e f g n Return “c”

Wavelet Tree (Rank)

 Rankc (T, pos) ahcbedghcfaehcgd
 e.g. Rankc(T, 9)

0100101101011010 b := B[9] = 0
pos := Rank0(B, 0) = 4
0 1
b := B[4] = 4
01011011 10110011
pos := Rank1(B, 4) = 2

b := B[2] = 0
010 01001 010 10110 pos := Rank0(B, 2) = 1

a b c d e f g n Return 1

Wavelet Tree (Select)

 Selectc (T, i)
 Example Selectc(T, 1)

0100101101011010 i := Select0(B,4) = 8
0 1

01011011 10110011 i := Select1(B,2) = 4

i := Select0(B,1) = 2
010 01001 010 10110

a b c d e f g n i=1

Wavelet Tree (Quantile List) Wavelet Tree is so called
double-sorted array

 Quantile(T, p, sp, ep) : Return the (p+1)-th largest value in T[sp, ep)
E.g. Quantile(T, 5, 4, 12)

0100101101011010 # of zeros in B[4,12] is 3
3 < (5+1) and go to right
0 1
01011011 # of zeros in B[1,6] is 3
10110011
3 > 2 and go to left

# of zeros in B[0, 2] is 2
010 01001 010 10110 2 = 2 and go to right

a b c d e f g n Return “f”
43782614
ahcbedghcfaehcgd

Wavelet Tree (top-k Frequent List )

 FreqList (T, k, sp, ep) : Return k most frequent values in T[sp, ep)
E.g. FreqList(T, 5, 4, 12)
Greedy Search on the tree.
0100101101011010
Larger ranges will be examined first.
0 1

01011011 10110011 Q := priority queue {# of elems, node}
Q.push {[ep-sp, root]}
while
q = Q.pop()
010 01001 010 10110 If q is leaf, report q and return
Q.push([0’s in range, q.left])
Q.push([1’s in range, q.right])
a b c d e f g n end while

Wavelet Tree (RangeFreq, RangeList)

 RangeList (T, sp, ep, min, max) : Return {T[i]} s.t. min≦ [i]<max, sp≦i<ep
 RangeFreq(T, sp, ep, min, max) : Return |RangeList(T, sp, ep, min, max)|
E.g. RangeList(T, 4, 12, c, e)

0123457890123456 Enumerate nodes that go to between
0100101101011010 min and max
0 1
For RangeList, # of enumerated
01011011 10110011 nodes is O(log σ + occ)

For RangeFreq, # of enumerated n
odes is O(log σ)
010 01001 010 10110

a b c d e f g n

Wavelet Tree Summary

Input T[0…n), 0≦T[i]<σ
Many operations can be supported in O(log σ) time using wavelet tree

Description Time
Lookup T[i] O(log σ)
Rankc(T, p) the number of c in T[0…p) O(log σ)
Selectc(T, i) the pos of (i+1)-th c in T O(log σ)
Quantile(T, s, e, r) (r+1)-th largest val in T[s, e) O(log σ)
FreqList(T, k, s, e) k most frequent values in T[s, e) Expected O(klog σ)
Worst O(σ)
RangeFreq(T, s, e, x, y) items x ≦ T[i] < y in T[s, e) O(log σ)

RangeList(T, s, e, x, y) List T[i] s.t. x ≦ T[i] < y in T[s, e) O(log σ + occ)
NextValue(T, s, e, x) smallest T[r] > x, s.t. s ≦ r ≦ e O(log σ)

Wavelet tree improvements

 Huffman Tree shaped Wavelet Tree (HWT)
 If the alphabet tree corresponds to the Huffman tree for T, the size of the
wavelet tree becomes nH0 + o(nH0) bits.
 Expected query time is also O(H0) when the query distribution is same as
in T

 Wavelet Tree for large alphabets
 For large alphabet set (e.g. σ ≒, n) the overhead of alphabet tree, and bit
vectors could large.
 Concatenate bit vectors of same depth. In traversal of tree, we iteratively
compute the range of the original bit array.

2D Search using Wavelet Tree

 Given a 2D data set D={(xi, yi)}, find all points in [sx, sy] x [ex, ey]
 By using a wavelet tree, wen can solve this in O(log ymax + occ) time

Remove columns with no data.
If there are more than one data in the same column, we copy the column so that
each column has exactly one data.

0 1 2 3 4 5 6 0 1 2 2 4 4 6 6 6
0 * 0 * Store the x information
using a bit array
1 * * 1 * *
Xb = 101011001100111
2 2
3 * * * 3 * * * Since each column has
4 * 4 * one value, we can convert
it to the string
5 * 5 * T =313613046
6 * 6 *

2D Search using Wavelet Tree (contd.)

 [sx, sy] x [ex, ey] region is found by
 s’ := Select0(X, sx)+1 , e’ := Select0(X, ex)+1
 s’’ := Rank1(X, s’) e’’ := Rank1(X, e’)
 Perform RangeFreq(s’’, e’’, sy, ey)

0 1 2 3 4 5 6
0 *
1 * *
X = 101011001100111
2
3 * * * T =313613046
4 *
5 *
6 *

Full-text Index

 Given a text T[0, n) and a query Q[0, m), support following operations.
 occ(T, Q) : Report the number of occurrence of Q in T
 locate(T, Q) :Report the positions of occurrence of Q in T

 Example
i 0123456789A
T abracadabra

occ(T, “abra”)=2
locate(T, “abra”)={0, 7}

Suffix Array

 Let Si = T[i, n] be the i-th suffix of T
 Sort {Si}i=0…n by its lexicographical order
 Store sorted index in SA[0…n] in n lg n bits (SSA[i] < SSA[i+1])

S0 abracadabra$ S11 $ 11
S1 bracadabra$ S10 a$ 10
S2 racadabra$ S7 abra$ 7
S3 acadabra$ S0 abracadabra$ 0
S4 cadabra$ S3 acadabra$ 3
S5 adabra$ S5 adabra$ 5
S6 dabra$ S8 bra$ 8
S7 abra$ S1 bracadabra$ 1
S8 bra$ S4 cadabra$ 4
S9 ra$ S6 dabra$ 6
S10 a$ S9 ra$ 9
S11 $ S2 racadabra$ 2

Search using SA

 Given a query Q[0…m), find the range [sp, ep) s.t.
sp := max{i | Q ≦ SSA[i]} Q=“ab”, sp=2, ep=4
ep := min{i | Q > SSA[i]} S11 $
S10 a$
 Since {SSA[i]} are sorted, sp and ep
S7 abra$
can be found by binary search in O(m log n) time S0 abracadabra$
 Each string comparison require O(m) time. S3 acadabra$
 The occurrence positions are SA[sp, ep) S5 adabra$
S8 bra$
 if ep=sp then Q is not found on T
S1 bracadabra$
S4 cadabra$
 Occ(T, Q) : O(m log n) time. S6 dabra$
 Locate(T, Q) : O(m log n + Occ) time S9 ra$
S2 racadabra$

Suffix Tree
11
10 $
 A trie for all suffixes of T $ bra 7
 Remove all nodes with only one child. $
c c 0
 Store suffix indexes (SA) at leaves
 Store depth information at internal nodes, and a 3
d
edge information is recovered from depth and SA . 5
 Fact: # of internal nodes is at most N-1 bra
$
c 8
 Proof: Consider inserting a suffix one by one.
At each insertion, at most one internal node will c 1
be build. d 4

ra 6
 Many string operations can be supported by using $
9
a suffix tree.
c 2

Burrows Wheeler Transform (BWT)

i SA BWT Suffix
 Convert T into BWT as follows
0 11 a $
BWT[i] := T[SA[i] – 1]
1 10 r a$
 if SA[i] = 0, BWT[i] = T[n]
2 7 d abra$

 abracdabra$->ard$rcaaaabb 3 0 $ abracadabra$
4 3 r acadabra$
5 5 c adabra$
6 8 a bra$
7 1 a bracadabra$
8 4 a cadabra$
9 6 a dabra$
10 9 b ra$
11 2 b racadabra$

Characteristics of BWT

 Reversible transformation
 We can recover T from only BWT
 Easily compressed
 Since similar suffixes tend to have preceding characters, same characters
tend to gather in BWT. bzip2
 Implicit suffix information
 BWT has same information as suffix arrays, and can be used for full-text
index (so called FM-index).

Example : Before BWT

When Farmer Oak smiled, the corners of his mouth spread till they were
within an unimportant distance of his ears, his eyes were reduced to chinks,
and diver gingwrinkles appeared round them, extending upon his
countenance like the rays in a rudimentary sketch of the rising sun. His
Christian name was Gabriel, and on working days he was a young man of
sound judgment, easy motions, proper dress, and general good character. On
Sundays he was a man of misty views, rather given to postponing, and
hampered by his best clothes and umbrella : upon the whole, one who felt
himself to occupy morally that vast middle space of Laodiceanneutrality which
lay between the Communion people of the parish and the drunken section, --
that is, he went to church, but yawned privately by the time the con+gegation
reached the Nicene …

Example : After BWT

ooooooioororooorooooooooorooorromrrooomooroooooooormoorooororioooro
ormmmmmuuiiiiiIiuuuuuuuiiiUiiiiiioooooooooooorooooiiiioooioiiiiiiiiiiioiiiiiieuiiiiiiii
iiiiiiiiiouuuuouuUUuuuuuuooouuiooriiiriirriiiiriiiiiiaiiiiioooooooooooooiiiouioiiiioiiu
iiuiiiiiiiiiiiiiiiiiiiiiiioiiiiioiuiiiiiiiiiiiiioiiiiiiiiiiiiioiiiiiiuiiiioiiiiiiiiiiiioiiiiiiiiiioiiiioiiiiiiioiiiaiiiiiiiiiiiiiiiii
oiiiiiioiiiiiiiiiiiiiiiuiiiiiiiiiiiiiiiiiioiiiiiiiioiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiuuuiioiiiiiuiiiiiiiiiiiiiiiiiiiiiiiioiiiiuioi
uiiiiiiioiiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioaoiiiiioioiiiiiiiioooiiiiiooioiiioiiiiiouiiiiiiiiiiiiooiiiiiiiiiiiiiiiiiiiii
iiiiiiiiiioiiiiiiiiiiiiiiiiiiioiooiiiiiiiiiiioiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiiiiiiiiiiiiioiiiuiiiiiiiiiioiiiiiiiiiiiiuoiii
oiiioiiiiiiiiiiiiiiiiiiiiiiuiiiiuuiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiuiuiiiiiuuiiiiiiiiiiiiiiiiiiiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiiioiii
iiiioiiiiiiiiiiiiiiiiiiiiioiiiiiiiiioiiiiuiiiioiiiioiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiioiiiiiiuiiiiiiiiiiiiiiiooiiiiiiiiiiiiiiiiii
iioooiiiiiiiioiiiiouiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioiiioiiiiiiioiiiiiioiiiiuiuoiiiiiiiiioiiiiiiioiiiiiiiiiiii
iiiiiiiiiiiiiiiioioi….

LF-mapping

i BWT Suffix (First char is F)
 Let F[i] = T[SA[i]]
0 a1 $
 First characters of suffixes
1 r a1$
 Fact : the order of same chars
2 d a2bra$
in BWT and F is same
3 $ a3bracadabra$
4 r a4cadabra$
5 c a5dabra$
6 a2 bra$
7 a3 bracadabra$
8 a4 cadabra$
9 a5 dabra$
10 b ra$
11 b racadabra$

LF-mapping (1/3)

0 a1
 First characters of suffixes
1 r a1
 Fact : the order of same chars
2 d a2
in BWT and F is same
3 $ a3bracadabra$

 Proof: the order of chars in BWT 4 r a4
is determined by the following 5 c a5
Same
suffixes. 6 a2
the order of chars in F is also 7 a3 bracadabra$
determined by the following 8 a4
suffixes. These are same ! 9 a5
10
11

LF-mapping (2/3)

0 a $
 First characters of sorted suffixes
1 r a$
2 d abra$
 Let C[c] be the number of
3 $ abracadabra$
characters smaller than c in T
 C[c] = beginning position of F[i] = c 4 r acadabra$
 For B[i]=c, its corresponding char 5 c adabra$
in F[i] is found by Rankc(T, i) + C[c] 6 a bra$
7 a bracadabra$
8 a cadabra$
9 a dabra$
10 b ra$
11 b racadabra$

LF-mapping (3/3)

 Let ISA[0, n) s.t. ISA[r]=p if SA[p] = r
 SA[i] converts the rank into the position, and ISA[i] vice versa.
 For c := BWT[r], we define two functions
LF(r) := Rankc(T, r) + C[c] = ISA[SA[r] -1]
PSI(r) := Selectc(T, r– C[c]) = ISA[SA[r] + 1]
 LF and PSI are inverse functions each other.
 Using LF and PSI, we can move a text in its position order.
 PSI = forward by one char , LF = backward by one char
 e.g. T = abracadabra

SA[LF[r]] SA[r] SA[PSI[r]]

Decoding of BWT using LF-mapping

r := beg // the rank of T[0, n), i.e. ISA[0]
while
Output BWT[r]
r := PSI(r)
until r != beg

That’s all !

In practice, PSI values are pre-computed using O(n) time

Compressed Full-Text Index

 Full-text can be compressed
 Moreover, we can recover original text from compressed index
(self-index)

 I explain FM-index only here.

 Other important compressed full-text indices
 Compressed Suffix Arrays
 LZ-index
 LZ77-based

FM-index P’=ra appear in P P=a appeared in [1,
5)
i BWT Suffix
 Using LF-mapping for searching
0 a $
pattern.
1 r a$
2 d abra$
 Assume that a range [sp, ep)
3 $ abracadabra$
correspond to a pattern P.
4 r acadabra$
 Then if a pattern P’ = cP occurs, 5 c adabra$
then [sp, ep) should be preceded by c 6 a bra$
7 a bracadabra$
8 a cadabra$
9 a dabra$
10 b ra$
11 b racadabra$

FM-index P’=ra appear in P P=a appeared in [1,
5)
i BWT Suffix
 Using LF-mapping for searching
0 a $
pattern.
1 r1 a$
2 d abra$
 Assume that a range [sp, ep)
3 $ abracadabra$
correspond to a pattern P.
4 r2 acadabra$
 Then if a pattern P’ = cP occurs, 5 c adabra$
then [sp, ep) is preceded by c 6 a bra$
7 a bracadabra$
 The corresponding suffixes can be 8 a cadabra$
found by Rankc(BWT, sp) + C[c] 9 a dabra$
and Rankc(BWT, ep) + C[c] 10 b r1a$
11 b r2acadabra$

FM-index

Input query : Q[0, m)
Output a range [sp, ep) corresponding to Q

sp := 0, ep := n
for i = m-1 to 0
c := Q[c]
sp := C[c] + Rankc(BWT, sp)
ep := C[c] + Rankc(BWT, ep)
end for
Return (sp, ep)

Occ(T, Q) = ep - sp
Locate(T, Q) = SA[sp, ep)

Storing Sampled SA’s
i SA R
 Store sampled SAs instead of original SA. 0 0
 Sample SA[p] s.t. SA[p] = tk for (t=0…) 1 0
 We also store a bit array R to indicate the index of 2 0
sampled SAs. 3 0 1
4 0
 Recall that LF(r) = ISA[SA[r]-1] for c = BWT[c]. 5 0
 If SA[p] is not sampled, we use LF to move on SA 6 8 1
 SA[r] = SA[LF(r)]-1 = SA[LF2(r)]-2 = … = SA[LFk(r)]-k 7 0
 We move on SA until the sampled point is found 8 4 1
using at most k LF operations. 9 0
10 0
 Set k = log1+εn. The size of A is (n/k) log n = o(n) 11 0
 The lookup of SA[i] requires O(log1+εn logσ) time.

Grammar Compression on SA [R. Gonzalez+ CPM07]

 SA can also be compressed using grammar based compressions
 Let DSA[i] := SA[i] – SA[i-1] (for i > 0)
 If BWT[i] = BWT[i+1] = … = BWT[i+run] then
DSA[i+1, i+run] = DSA[k+1, k+run] for k := LF[i]
 DSA contains many repetition if BWT has many runs.

 DSA can be compressed using grammar compression
 [R. Gonzelez+ CPM07] used Re-pair but other grammar compressions
can be used as long as they support fast decompression.

Compression Boosting in FM-index

 The size of FM-index using wavelet trees is nH0 + o(nH0) bits
 We can improve the space to nHK by compression boosting
 Compression Boosting [P. Ferragina+ ACM T 07]
 Divide BWT into blocks using contexts of length k
 Build a separate wavelet tree for each block
 Implicit Compression Boosting [Makinen+ SPIRE 07]
 Wavelet Tree + compressed bit vectors using RRR in Wavelet Tree
The size is close to optimal: the overhead is at most σk+1 log n, which is
o(n) when k ≦ ((1-ε)logσn)-1
b aaaa..
b aabb.. Divide BWT’s
a aacc.. using following
b aacd.. suffix of length
d abcd..
d abcc..

Compression Boosting using fixed length block
[Karkkaine+ SPIRE 11]
 Dividing BWT into blocks of fixed length can also achieve nHK
 Block size is σ(log n)2
 To analyze the working space, we calculate the overhead incurred of
conversion from context-sensitive partition into fixed-length partition

 Note. This encoding is very similar to the encoding used in bzip2

SSA: Huffman WT
AFFMI : Context Sensitive
Partition of BWT
Fixed Block : Fixed Block

+ RRR : Represent bit vectors
in WT using RRR encoding

Summary of state-of-the-art

 Give an input String T, compute BWT for T (=B).
 Divide B into blocks of fixed length
 Each block is stored in a Huffman-shaped wavelet tree.
 A bit-vector in a wavelet tree is stored using RRR representation.
 For English, 2-3 bits per character, 1 pattern 0.06 msec.

Document Query

 DL(D, P) : Document Listing
 List the distinct documents where P appears
 DLF(D, P) :Document Listing with Frequency
 The result of DL(D, P) + frequency of P in each document
 Top(k, D, P) : Top-k Retrieval
 List the k documents where P appears with largest scores S(D, P)
 e.g. S(D, P) : term frequency, document’s score (e.g. Page rank)

 In the previous section we consider hit positions, and here we consider hit
documents
 We need to convert hit positions into hit documents.
 This is not easy.

Index for document collection

 For a document collection D={d1, d2, …, dD}
 Build T = d1#d2#d3#...#dD and build an full-text index for T
 We also use a bit array to indicate the beginning positions of docs.
 Given a hit position in T, we can convert it to the hit doc in O(1) time.

 Baseline of DL and DLF
 Perform Locate(T, Q) and convert each hit positions into a hit doc each
independently.
 However, this requires O(occ(T, Q)) time, and occ(T, Q) could be much larger
than # of hit documents

DocArray and PrevArray (1/2)
[Muthukrishnan SODA 02]

 DocArray : D[0, n] s.t. D[i] is the doc ID of SA[i].
 DL(D, Q) is now to list all distinct values in the hit range D[sp, ep)
 PrevArray : P[0, n] s.t. P[i]=max{j < I, D[j]=D[i]}
 P[i] can be simulated as selectD[i] (D, rankD[i](D, i-1))

i 0 1 2 3 4 5 6
D 66 77 66 88 77 77 88
P -1 -1 0 -1 1 4 3

DocArray and PrevArray (2/2)

 Alg. DocList reports all distinct doc IDs in [sp, ep) in O(|DL|) time.
 Call DocList(sp, sp, ep)

DocList(gsp, sp, ep) i 0 1 2 3 4 5 6
IF sp=ep D 66 77 66 88 77 88 88
RETURN;
(min, pos) := RMQ(P, sp, ep) P -1 -1 0 -1 1 3 5
IF min ≧ gsp
RMQ(P, 2, 7) = -1 (=P[3])
RETURN; Report D[3]
Report D[pos]
DocList(gsp, sp, pos)
DocList(gsp, pos+1, ep)

RMQ(P, 2, 3) = 0 (=P[2])
RMQ(P, 5, 7) = 3 (=P[5])
Report D[2]
Return

Wavelet Tree + Grammar Compression
[Navarro+ SEA 11]

 Build a wavelet tree for D.
 Using Freqlist(k, sp, ep), we can solve DL and DLF in O(log σ) time.
 However storing D requires n lg |D| bits. This is very large.

 As in compression of suffix arrays using grammar compression, D can
also be compressed by grammar compression
 Apply a grammar compression to bit arrays in wavelet trees, and support
rank/select operations

 Currently, this is the fastest/smallest approach in practice.

Suffix Tree based approach (1/4)
[W.K. Hon+ FOCS 09]
 Let ST be the generalized suffix tree for a collection of documents
 A leaf l is marked with document d if l corresponds to d
 An internal node l is marked with d if at least two children of v contain
leaves marked with d
 An internal node can be marked with many values d.

1 c

2 a b c

3 7 a 10 c

4 5 6 8 9 11 12 13 14
a c b a a c c b c

Suffix Tree based approach (2/4)

 In every node v marked with d, we store a pointer ptr(v, d) to its lowest
ancestor u such that u is also marked with d
 ptr(v, d) also has a weight corresponds to relevance score.
 If such ancestor does not exist, ptr(v, d) points to a super root node.

1 c
ptr(7, a) = 2
2 a b c
ptr(13, a) = 2

3 7 a 10 c

4 5 6 8 9 11 12 13 14
a c b a a c c b c

Suffix Tree Based Approach (3/4)

 Lemma 1. The total number of pointers ptr(*,*) in T is at most 2n
 Lemma 2. Assume that document d contains a pattern P and v is the node
corresponding to P. Then there exist a unique pointer ptr(u, d), such that u
is in the subtree of v and ptr(u, d) points to an ancestor of v

1 c
Example: when v = 10. ptr(10, c)=2,
2 a b c and ptr(13, b)=2.
These ptrs correspond to
c the occurrence of P in b and c.
10

11 12 13
c c b

Suffix Tree Based Approach (4/4)

 To enumerate document corresponding to the node x, find all ptrs that
begins in the subtree of x, and points to the ancestor of x.
 These ptrs uniquely corresponds to each hit documents (Lemma)

x

Geometric Interpretation
pd

pd(v, c) := -1 c a b
depth of ptr(v, c) 0 c c
1 a c b a c b
place all ptrs in 2 a a c c
a point (x, pd(v,c))
1 2 2 2 4 5 6 7 8 9 10 11 12 13 14

depth
1 c 0

2 a b c 1

3 7 a 10 c 2

4 5 6 8 9 11 12 13 14 3

a c b a a c c b c

Document Query = three-sided 2D query

 When v := locus(Q) d := depth(v), and r be the maximal index assigned to
v or its descendants. DL can be solve by
RangeList(v, r, 0, d-1)

pd
-1 v=2
c a b v=10
0 c c
1 a c b a c b
2 a a c c
1 2 2 2 4 5 6 7 8 9 10 11 12 13 14

Top-K query

 Each ptr(v, d) can store any score function depended on Q, D.
 For each node v, f ptr(x, d) points to vTop-K

 Theorem [Navarro+ SODA 12] k most highly weighted points in the range
[a, b] x [0, h] can be reported in decreasing order of their weights in O(h+k)
time

 In practice, we can use a priority queue on each node.

Bibliography

[Navarro+ SEA 12] Practical Compressed Document Retrieval, G. Navarro, S.
J. Puglisi, and D. Valenzuela, SEA 2012
[Navarro+ SODA 12] Top-k Document Retrieval in Optimal Time and Linear
Space, SODA 2012
[T. Gagie+ TCS+ 11] New Algorithms on Wavelet Trees and Applications to
Information Retrieval, TCS 2011
[G. Navarro+ 11] Practical Top-K Document Retrieval in Reduced Space
[Karkkaine+ SPIRE 11] Fixed Block Compression Boosting in FM-indexes, J.
Karkkainen, S. J. Puglisi, SPIRE 2011
[W.K. Hon+ FOCS 09] Space-Efficient Framework for Top-k String Retrieval
Problems, W. K. Hon, R. Shah, J. S. Vitter, FOCS 2009

[Navarro+ SEA 11] “Practical Compressed Document Retrieval”, Gonzalo
Navarro , Simon J. Puglisi, and Daniel Valenzuela,, SEA 2011
[Navarro+ SODA 12] Top-k Document Retrieval in Optimal Time and Linear
Space , Gonzalo Navarro, Yakov Nekrich, SODA 2012
[Gagie+ JTCS 11] New Algorithms on Wavelet Trees and Applications to
Information Retrieval, Travis Gagie , Gonzalo Navarro , Simon J. Puglisi, J. of
TCS 2011

Succinct Data Structure for Analyzing Document Collection

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Succinct Data Structure for Analyzing Document Collection

Similaire à Succinct Data Structure for Analyzing Document Collection (20)

Plus de Preferred Networks

Plus de Preferred Networks (20)

Dernier

Dernier (20)

Succinct Data Structure for Analyzing Document Collection