Advantages of Hiring UIUX Design Service Providers for Your Business
Gwt sdm public
1. 2011 SIAM International Conference on Data Mining,
Apr. 28, 2011
Kernel-based similarity search
in massive graph databases
with wavelet trees
Yasuo Tabei and Koji Tsuda
JST ERATO Minato Project, Japan
AIST Comp Bio Research Center, Japan
2. Graph similarity search
n Similarity search for 25 million molecular graphs
• Find all graphs whose similarity to the query ≥ 1 −
• Similarity = Weisfeiler-Lehman graph kernel (NIPS,2009)
n Use data structure called “Wavelet Tree”
(SODA,2003)
• Self-index of an integer array
• Combination of rank dictionaries of bit arrays
• Enables fast array operations
• e.g., range minimum query, range intersection
n First review wavelet tree in the range intersection
problem, then graph similarity search
11/05/12 2
3. Range intersection on array
n Array A of length N, 1 ≤ Ai ≤ M
i
j
k
A 1 3 6 8 2 5 7 1 2 7 4 5
n Range intersection: rint(A,[i,j],[k,])
• Find set intersection of A[i,j] and A[k,]
• The naïve solution is to concatenate and sort
n Use an index structure of the array (Wavelet
Tree) and solve the problem faster!
11/05/12 3
5. Remember if each element is either in
lower half (0) or higher half (1)
[1,8]
0 0 1 1 0 1 1 0 1 0 0 1
[1,4] [5,8]
0 1 0 0 0 1 0 1 0 1 1 0
[1,2] [3,4] [5,6] [7,8]
0 1 0 1 0 1 1 0 0 1 0 0
1
2
3
4
5
6
7
8
11/05/12 5
6. Index each bit array with a rank
dictionary
n With rank dictionary, the rank operation can be
done in O(1) time
• rankc(B,i): return the number of c {0,1} in B[1…i]
n Several algorithms known: rank9sel
Example) B=0110011100
i 1 2 3 4 5 6 7 8 9 10
rank1(B,8)=5 011001110 0
rank0(B,5)=3
0 1 1 0 0 1 1 1 0 0
11/05/12 6
7. Implementation of rank dictionary
n Divide the bit array B into
B
large blocks of length l=log2n
RL=Ranks of large blocks
RL
n Divide each large block to
small blocks of length s=logn/2
Rs=Ranks of small blocks
RS
relative to the large block
rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank)
Time:O(1)
Memory: n +o(n) bits
8. O(1)-time division of an interval
n Using the rank operations, the division of an
interval can be done in constant time
• rank0 for left child and rank1 for right child
n Naïve = linear time to the total number of elements
[1,8]
Aroot 1 3 6 8 2 5 7 1 7 2 4 5
rank0
rank1
[1,4] [5,8]
Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
11/05/12 8
10. Graph Similarity Search
n Bag-of-words representation of graph
• Weisfeiler-Lehman procedure (NIPS, 2009), Hido and
Kashima(2009),Wang et al.,(2009)
W=(A,D,E,H)
n Cosine similarity query
• Find all graphs W whose cosine similarity
(kernel) to the query Q is at least 1-ε
11/05/12 10
11. Weisfeiler-Lehman Procedure (NIPS,09)
n Convert a graph into a set of words (bag-of-words)
i) Make a label set of adjacent
A
vertices ex) {E,A,D}
B
E
ii) Sort ex) A,D,E
iii) Add the vertex label as a prefix
D
ex) B,A,D,E
A
iv) Map the label sequence to a
R
E
unique value
ex) B,A,D,E R
D
v) Assign the value as the new
Bag-of-words
{A,B,D,E,…,R,…}
vertex label
12. Semi-conjunctive query
n Cosine similarity query can be relaxed to the
following form
W s.t |W ∩ Q| ≥ k
n Find all graphs W which share at least k words to the
query Q
n No false negatives
n False positives can easily be filtered out by cosine
calculations
11/05/12 12
13. Inverted index, Array, Wavelet Tree
n Graph database is represented
as inverted index
n Concatenate all rows to make
an array
n Index the array with wavelet
tree
Aroot 1 3 6 8 2 5 7 1 2 7 4 5
n Semi-conjunctive query =
Extension of range intersection
Wavelet Tree
• Find graph ids which appear at
least k times in the array
11/05/12 13
14. Pruning condition
n Find all graphs W in the database whose cosine
is larger than a threshold 1-ε
| W !Q |
W s.t K N (W,Q) = ! 1 !
|W | |Q |
l W, Q: bag-of-words of graphs
n The above solution can be relaxed as follows,
If K N (Q,W ) ! 1 ! , then
|Q |
(1! ! )2 | Q |!| W |!
(1! ! )2
l Can be used for pruning
15. Complexity
n Time per query: O(τm)
• τ: the number of traversed nodes
• m: the number of bag-of-words in a query
n Memory: (1+α)N log n + M log N
• N: the number of all words in the database
• M: Maximum integer in the array
• n: the number of graphs
• α: overhead for rank dictionary (α=0.6)
n Inverted index takes Nlogn bits
n About 60% overhead to inverted index!
16. Experiments
n 25 million chemical compounds from PubChem
database
n Evaluated search time and memory usage
n Cosine thresholds ε=0.3,0.35,0.4
n Compare our method gWT to
• Inverted index (concatenate all intervals and sort)
• Sequential scan (compute kernel one by one)
20. Related work
A lot of methods have been proposed so far.
n
1.Graph Grep [Shasha et al., 02]
2.gIndex [Yan et al., 04]
3.Tree+Delta [Zhao et al., 07]
4.TreePi [Zhang et al., 07]
5.Gstring [Jiang et al., 07]
6.FG-Index [Cheng et al., 07]
7.GDIndex [Williams et al., 07]
etc
21. Related work
n Decompose graphs into
a set of substructures
• graphs, trees, paths
Decompose
etc
…
n Build a substructure
based index
n Drawbacks
Index
• Need substructure
mining
• Do not scale to
millions of graphs
22. Summary
n Efficient similarity search method of massive
graph databases
n Solve semi-conjunctive queries efficiently
n Our method is built on wavelet trees
n Use Weisfeiler-Lehman procedure to convert
graphs into bag-of-words
n Applicable to 25 million graphs
n Software
http://code.google.com/p/gwt/