Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Gwt presen alsip-20111201
1. ALSIP, Dec. 1 2011
Kernel-based similarity search
in massive graph databases
with wavelet trees
Yasuo Tabei and Koji Tsuda
JST ERATO Minato Project,
National Institute of Advanced Industrial
Science and Technology
2. Outline
• Overview
• Wavelet Tree
✓ Problem = Range intersection on array
• Graph similarity search
✓ Weisfeiler-Lehman kernel
✓ Apply wavelet tree
• Experiments
✓ Comparison to inverted index
✓ 25 million molecular graphs
3. Graph similarity search
• Similarity search for 25 million molecular
graphs
✓ Find all graphs whose similarity to the query 1
✓ Similarity = Weisfeiler-Lehman kernel (NIPS, 2009)
• Use data-structure called “Wavelet
Tree” (SODA, 2003)
✓ Self-index of an integer array
✓ Enable fast array operations
‣ e.g., range minimum query, range intersection
4. Range intersection on array
• Array A of length N, 1 Ai M
i j k "
A 1 3 6 8 2 5 7 1 2 7 4 5
• Range intersection: rint(A, [i,j],[k,])
✓ Find common elements of A[i,j] and A[k,]
• The naive method is to concatenate and sort
Ex) concatenate:6,8,2,2,7 ⇛ sort:2,2,7,6,8
• Use wavelet tree and solve the problem faster
6. Remember if each element is either
in lower half(0) or higher half(1)
[1,8]
0 0 1 1 0 1 1 0 1 0 0 1
[1,4] [5,8]
0 1 0 0 0 1 0 1 0 1 1 0
[1,2] [3,4] [5,6] [7,8]
0 1 0 1 0 1 1 0 0 1 0 0
1 2 3 4 5 6 7 8
7. Index each bit array
with a rank dictionary
• Using rank dictionary, the rank operation can be
performed in O(1) time
✓ rankc(B, i): return the number of c {0, 1} in B[1,i]
• Several methods known: rank9sel (Vigna, 08)
• Example) B=0110011100
i 1 2 3 4 5 6 7 8 9 10
rank1 (B, 8) = 5 011001110 0
rank0 (B, 5) = 3 011001110 0
8. O(1)-division of an interval
• Using the rank operation, the division of an
•
interval can be done in constant time
✓ rank0 for left child and rank1 for right child
• Naive = linear time to the total number of elements
[1,8]
Aroot 1 3 6 8 2 5 7 1 7 2 4 5
rank0 rank1
[1,4] [5,8]
Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
10. Outline
• Overview
• Wavelet Tree
✓ Problem = Range intersection on array
• Graph similarity search
✓ Weisfeiler-Lehman kernel
✓ Apply wavelet tree
• Experiments
✓ Comparison to inverted index
✓ 25 million molecular graphs
11. Graph Similarity Search
• Bag-of-words representation of graph
✓ Weisfeiler-Lehman procedure (NIPS, 2009), Hido and
Kashima (ICDM, 2009), Wang et al., (EDBT, 2009)
W=(A,D,E,H)
• Consine similarity query
✓ Find all graphs W whose cosine similarity (kernel) to
the query Q is at least 1
13. Semi-conjunctive query
• Cosine similarity query can be relaxed to
the following form
W s.t. |W Q| k
✓ Find all graphs W which share at least k words
to the query Q
• No false negatives
• False positives can easily be filtered out by
cosine calculations
14. Inverted index, Array, Wavelet Tree
• Inverted index is built from
graph database
• Concatenate all rows to make
•
an array
• Index the array with wavelet
•
tree
Aroot 1 3 6 8 2 5 7 1 2 7 4 5
• Semi-conjunctive query =
•
Extension of range intersection
Wavelet Tree
✓ Find graph ids which appear at
least k times in given intervals
15. Pruning search space
• Find all graphs W in the database whose cosine
to a query Q is larger than a threshold 1
|W Q|
W s.t. KN (W, Q) = 1
W Q
✓ W,Q: bag-of-words of graphs
• The above solution can be relaxed as follows
•
If KN (W, Q) 1 , then
|Q|
(1 ) |Q|
2
|W |
(1 )2
✓ Can be used for pruning search space
16. Complexity
• Time per query: O(τm)
• τ: the number of traversed nodes
• m: the number of bag-of-words in a query
• Memory: (1+α)N log n + M log N
• N: the number of all words in the database
• M: Maximum integer in the array
• n: the number of graphs
• α: overhead for rank dictionary (α=0.6)
• Inverted index takes Nlog n bits
• About 60% overhead to inverted index!
17. Outline
• Overview
• A data-structure
✓ Wavelet Tree
• Graph similarity search
✓ Weisfeiler-Lehman kernel
✓ Apply wavelet tree
• Experiments
✓ Comparison to inverted index
✓ 25 million molecular graphs
18. Experiments
• 25 million chemical compounds from PubChem
database
• Evaluate search time and memory usage
• Cosine threshold ε=0.3,0.35,0.4
• Compare our method gWT to
✓ Inverted index (concatenate all intervals and sort)
✓ Sequential scan (Compute similarity one by one)
22. Summary
• Efficient similarity search method of
massive graph databases
• Solve semi-conjunctive query efficiently
• Build on Wavelet Tree
• Use Weisfeiler-Lehman procedure to
represent graphs as bag-of-words
• Applicable to 25 million graphs
• Software
•
http://code.google.com/p/gwt