SlideShare une entreprise Scribd logo
1  sur  22
2011 SIAM International Conference on Data Mining,
Apr. 28, 2011


       Kernel-based similarity search
       in massive graph databases
       with wavelet trees

         Yasuo Tabei and Koji Tsuda

         JST ERATO Minato Project, Japan
         AIST Comp Bio Research Center, Japan
Graph similarity search	
n    Similarity search for 25 million molecular graphs
       •  Find all graphs whose similarity to the query ≥ 1 − 
       •  Similarity = Weisfeiler-Lehman graph kernel (NIPS,2009)
n    Use data structure called “Wavelet Tree”
      (SODA,2003)
       •  Self-index of an integer array
          •  Combination of rank dictionaries of bit arrays
       •  Enables fast array operations
          •  e.g., range minimum query, range intersection
n    First review wavelet tree in the range intersection
      problem, then graph similarity search
                                                              11/05/12   2
Range intersection on array	
n    Array A of length N,    1 ≤ Ai ≤ M
                    i	
  j	
    k   
         A       1 3 6 8 2 5 7 1 2 7 4 5

n    Range intersection: rint(A,[i,j],[k,])
      •  Find set intersection of A[i,j] and A[k,]
      •  The naïve solution is to concatenate and sort
n    Use an index structure of the array (Wavelet
      Tree) and solve the problem faster!	


                                                 11/05/12   3
Tree of subarrays:
Lower half = left, Higher half=right	
          [1,8]
              1 3 6 8 2 5 7 1 7 2 4 5

  [1,4]                                             [5,8]
          1 3 2 1 2 4                 6 8 5 7 7 5

[1,2]                   [3,4] [5,6]                 [7,8]
   1 2 1 2          3 4          6 5 5        8 7 7


 1 1       2 2      3      4     5 5     6    7 7      8
Remember if each element is either in
lower half (0) or higher half (1) 	
           [1,8]
               0 0 1 1 0 1 1 0 1 0 0 1

   [1,4]                                                 [5,8]
           0 1 0 0 0 1                 0 1 0 1 1 0


 [1,2]                   [3,4] [5,6]                    [7,8]
     0 1 0 1             0 1      1 0 0          1 0 0


   1	
       2	
     3	
 4	
       5	
 6	
       7	
 8	
                                             11/05/12            5
Index each bit array with a rank
 dictionary	
n    With rank dictionary, the rank operation can be
      done in O(1) time
      •  rankc(B,i): return the number of c   {0,1} in B[1…i]
n    Several algorithms known: rank9sel	
  Example) B=0110011100	
                                i 1 2 3 4 5 6 7 8 9 10
         rank1(B,8)=5             011001110 0
         rank0(B,5)=3	
           0 1 1 0 0 1 1 1 0 0	

                                                     11/05/12   6
Implementation of rank dictionary	

                        n  Divide the bit array B into
B	
                     large blocks of length l=log2n
                          RL=Ranks of large blocks
RL	
                    n  Divide each large block to
                        small blocks of length s=logn/2
                          Rs=Ranks of small blocks
RS	
                         relative to the large block
            rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank)	
                              Time:O(1)
                              Memory: n +o(n) bits
O(1)-time division of an interval	
n    Using the rank operations, the division of an
      interval can be done in constant time
      •  rank0 for left child and rank1 for right child
n    Naïve = linear time to the total number of elements

           [1,8]
            Aroot 1 3 6 8 2 5 7 1 7 2 4 5

                      rank0	
                                rank1	
       [1,4]                            [5,8]
       Aleft    1 3 2 1 2 4              Aright 6 8 5 7 7 5
                                                          11/05/12     8
Fast computation of rank
             intersection by pruning	
Pruned      [1,8]
                  1 3 6 8 2 5 7 1 7 2 4 5

    [1,4]                                                [5,8]
            1 3 2 1 2 4                    6 8 5 7 7 5

  [1,2]                      [3,4] [5,6]             [6,8]
     1 2 1 2             3 4          6 5 5        8 7 7


    1 1       2 2        3      4     5 5     6    7 7     8

            solution!!
Graph Similarity Search	

n    Bag-of-words representation of graph
      •  Weisfeiler-Lehman procedure (NIPS, 2009), Hido and
         Kashima(2009),Wang et al.,(2009)

                         W=(A,D,E,H)	

n    Cosine similarity query
      •  Find all graphs W whose cosine similarity
         (kernel) to the query Q is at least 1-ε


                                              11/05/12    10
Weisfeiler-Lehman Procedure (NIPS,09)
                                    	
 n Convert   a graph into a set of words (bag-of-words)	

                  i) Make a label set of adjacent
   A	
                     vertices ex) {E,A,D}
        B	
  E	
  ii) Sort    ex) A,D,E
                  iii) Add the vertex label as a prefix
   D	
                              ex) B,A,D,E
    A	
                  iv) Map the label sequence to a
        R	
  E	
      unique value
                               ex) B,A,D,E R
    D	
                  v) Assign the value as the new
  Bag-of-words
  {A,B,D,E,…,R,…}	
 vertex label
Semi-conjunctive query	

n    Cosine similarity query can be relaxed to the
      following form	
                W s.t |W ∩ Q| ≥ k
n  Find all graphs W which share at least k words to the
    query Q
n  No false negatives
n  False positives can easily be filtered out by cosine
    calculations


                                            11/05/12   12
Inverted index, Array, Wavelet Tree	
         	
                   	
    n   Graph database is represented
    	
                   	
              as inverted index
    	
              	
                                     n  Concatenate all rows to make
    	
              	
    	
         	
                                         an array
                                     n  Index the array with wavelet
                                         tree
Aroot 1 3 6 8 2 5 7 1 2 7 4        5
                                     n  Semi-conjunctive query =
                                         Extension of range intersection
     Wavelet Tree	
                       •  Find graph ids which appear at
                                             least k times in the array

                                                            11/05/12     13
Pruning condition	

n    Find all graphs W in the database whose cosine
      is larger than a threshold 1-ε
                                          | W !Q |
                   W s.t K N (W,Q) =                ! 1 !
                                          |W | |Q |
      l    W, Q: bag-of-words of graphs
n The above solution can be relaxed as follows,
  If K N (Q,W ) ! 1 ! , then
                                                |Q |
                      (1! ! )2 | Q |!| W |!
                                              (1! ! )2

      l    Can be used for pruning
Complexity	

n   Time per query: O(τm)
      •  τ: the number of traversed nodes
      •  m: the number of bag-of-words in a query
n  Memory: (1+α)N log n + M log N
      •  N: the number of all words in the database
      •  M: Maximum integer in the array
      •  n: the number of graphs
      •  α: overhead for rank dictionary (α=0.6)
 n  Inverted index takes Nlogn bits
 n  About 60% overhead to inverted index!
Experiments	

n  25 million chemical compounds from PubChem
    database
n  Evaluated search time and memory usage
n  Cosine thresholds ε=0.3,0.35,0.4
n  Compare our method gWT to
      •  Inverted index (concatenate all intervals and sort)
      •  Sequential scan (compute kernel one by one)
Search time	
                40 sec	
                38 sec	




                8 sec	
                3 sec	
                2 sec
Memory usage	
                 20GB
Construction time	

                      7h
Related work	
  A lot of methods have been proposed so far.
n 
 1.Graph Grep [Shasha et al., 02]
 2.gIndex [Yan et al., 04]
 3.Tree+Delta [Zhao et al., 07]
 4.TreePi [Zhang et al., 07]
 5.Gstring [Jiang et al., 07]
 6.FG-Index [Cheng et al., 07]
 7.GDIndex [Williams et al., 07]
 etc
Related work 	
                      n  Decompose graphs into
                          a set of substructures
                           •  graphs, trees, paths
        Decompose	
                              etc

                   …	
n  Build a substructure
                          based index
                      n  Drawbacks
         Index	
                           •  Need substructure
                              mining
                           •  Do not scale to
                              millions of graphs
Summary	
n  Efficient similarity search method of massive
    graph databases
n  Solve semi-conjunctive queries efficiently
n  Our method is built on wavelet trees
n  Use Weisfeiler-Lehman procedure to convert
    graphs into bag-of-words
n  Applicable to 25 million graphs
n  Software
 http://code.google.com/p/gwt/

Contenu connexe

Tendances

Neural Processes
Neural ProcessesNeural Processes
Neural ProcessesSangwoo Mo
 
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsGroup {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsIJASRD Journal
 
6 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-746 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-74Alexander Decker
 
l1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic ApplicationsGrigory Yaroslavtsev
 
NIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningzukun
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureVjekoslavKovac1
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flowsVjekoslavKovac1
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
11.the univalence of some integral operators
11.the univalence of some integral operators11.the univalence of some integral operators
11.the univalence of some integral operatorsAlexander Decker
 
The univalence of some integral operators
The univalence of some integral operatorsThe univalence of some integral operators
The univalence of some integral operatorsAlexander Decker
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilationsVjekoslavKovac1
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural NetworkRuochun Tzeng
 
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...Yandex
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisVjekoslavKovac1
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networksDavid Gleich
 

Tendances (20)

Neural Processes
Neural ProcessesNeural Processes
Neural Processes
 
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsGroup {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs
 
6 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-746 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-74
 
Mgm
MgmMgm
Mgm
 
1 cb02e45d01
1 cb02e45d011 cb02e45d01
1 cb02e45d01
 
l1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applications
 
NIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learning
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
 
11.the univalence of some integral operators
11.the univalence of some integral operators11.the univalence of some integral operators
11.the univalence of some integral operators
 
The univalence of some integral operators
The univalence of some integral operatorsThe univalence of some integral operators
The univalence of some integral operators
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Tensorizing Neural Network
Tensorizing Neural NetworkTensorizing Neural Network
Tensorizing Neural Network
 
On the Zeros of Complex Polynomials
On the Zeros of Complex PolynomialsOn the Zeros of Complex Polynomials
On the Zeros of Complex Polynomials
 
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
Nelly Litvak – Asymptotic behaviour of ranking algorithms in directed random ...
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysis
 
Lec28
Lec28Lec28
Lec28
 
Higher-order organization of complex networks
Higher-order organization of complex networksHigher-order organization of complex networks
Higher-order organization of complex networks
 

En vedette

「PV、UBなどの数値からでは見えてこないユーザー行動の可視化」#yjdsw2
「PV、UBなどの数値からでは見えてこないユーザー行動の可視化」#yjdsw2「PV、UBなどの数値からでは見えてこないユーザー行動の可視化」#yjdsw2
「PV、UBなどの数値からでは見えてこないユーザー行動の可視化」#yjdsw2Yahoo!デベロッパーネットワーク
 
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)Hadoop / Spark Conference Japan
 
Hadoopカンファレンス20140707
Hadoopカンファレンス20140707Hadoopカンファレンス20140707
Hadoopカンファレンス20140707Recruit Technologies
 
Hivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetupHivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetupMakoto Yui
 
FluentdやNorikraを使った データ集約基盤への取り組み紹介
FluentdやNorikraを使った データ集約基盤への取り組み紹介FluentdやNorikraを使った データ集約基盤への取り組み紹介
FluentdやNorikraを使った データ集約基盤への取り組み紹介Recruit Technologies
 
実践機械学習 — MahoutとSolrを活用したレコメンデーションにおけるイノベーション - 2014/07/08 Hadoop Conference ...
実践機械学習 — MahoutとSolrを活用したレコメンデーションにおけるイノベーション - 2014/07/08 Hadoop Conference ...実践機械学習 — MahoutとSolrを活用したレコメンデーションにおけるイノベーション - 2014/07/08 Hadoop Conference ...
実践機械学習 — MahoutとSolrを活用したレコメンデーションにおけるイノベーション - 2014/07/08 Hadoop Conference ...MapR Technologies Japan
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi
 
Shib: WebUI tool provides crossover of Hive and MPP
Shib: WebUI tool provides crossover of Hive and MPPShib: WebUI tool provides crossover of Hive and MPP
Shib: WebUI tool provides crossover of Hive and MPPSATOSHI TAGOMORI
 

En vedette (10)

「PV、UBなどの数値からでは見えてこないユーザー行動の可視化」#yjdsw2
「PV、UBなどの数値からでは見えてこないユーザー行動の可視化」#yjdsw2「PV、UBなどの数値からでは見えてこないユーザー行動の可視化」#yjdsw2
「PV、UBなどの数値からでは見えてこないユーザー行動の可視化」#yjdsw2
 
Hcj2014 myui
Hcj2014 myuiHcj2014 myui
Hcj2014 myui
 
「最近傍検索とその応用」#yjdsw2
「最近傍検索とその応用」#yjdsw2「最近傍検索とその応用」#yjdsw2
「最近傍検索とその応用」#yjdsw2
 
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
Mahoutによるアルツハイマー診断支援へ向けた取り組み (Hadoop Confernce Japan 2014)
 
Hadoopカンファレンス20140707
Hadoopカンファレンス20140707Hadoopカンファレンス20140707
Hadoopカンファレンス20140707
 
Hivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetupHivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetup
 
FluentdやNorikraを使った データ集約基盤への取り組み紹介
FluentdやNorikraを使った データ集約基盤への取り組み紹介FluentdやNorikraを使った データ集約基盤への取り組み紹介
FluentdやNorikraを使った データ集約基盤への取り組み紹介
 
実践機械学習 — MahoutとSolrを活用したレコメンデーションにおけるイノベーション - 2014/07/08 Hadoop Conference ...
実践機械学習 — MahoutとSolrを活用したレコメンデーションにおけるイノベーション - 2014/07/08 Hadoop Conference ...実践機械学習 — MahoutとSolrを活用したレコメンデーションにおけるイノベーション - 2014/07/08 Hadoop Conference ...
実践機械学習 — MahoutとSolrを活用したレコメンデーションにおけるイノベーション - 2014/07/08 Hadoop Conference ...
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
 
Shib: WebUI tool provides crossover of Hive and MPP
Shib: WebUI tool provides crossover of Hive and MPPShib: WebUI tool provides crossover of Hive and MPP
Shib: WebUI tool provides crossover of Hive and MPP
 

Similaire à Gwt sdm public

Similaire à Gwt sdm public (20)

SISAP17
SISAP17SISAP17
SISAP17
 
Lca seminar modified
Lca seminar modifiedLca seminar modified
Lca seminar modified
 
Salem Almarar Heckman MAT 242 Spring 2017Assignment Chapter .docx
Salem Almarar Heckman MAT 242 Spring 2017Assignment Chapter .docxSalem Almarar Heckman MAT 242 Spring 2017Assignment Chapter .docx
Salem Almarar Heckman MAT 242 Spring 2017Assignment Chapter .docx
 
PAM.ppt
PAM.pptPAM.ppt
PAM.ppt
 
Linear sorting
Linear sortingLinear sorting
Linear sorting
 
Review session2
Review session2Review session2
Review session2
 
Algorithm Assignment Help
Algorithm Assignment HelpAlgorithm Assignment Help
Algorithm Assignment Help
 
Algorithm Exam Help
Algorithm Exam HelpAlgorithm Exam Help
Algorithm Exam Help
 
Lec35
Lec35Lec35
Lec35
 
Week09
Week09Week09
Week09
 
sorting
sortingsorting
sorting
 
Ch07 linearspacealignment
Ch07 linearspacealignmentCh07 linearspacealignment
Ch07 linearspacealignment
 
Hussam Malibari Heckman MAT 242 Spring 2017Assignment Chapte.docx
Hussam Malibari Heckman MAT 242 Spring 2017Assignment Chapte.docxHussam Malibari Heckman MAT 242 Spring 2017Assignment Chapte.docx
Hussam Malibari Heckman MAT 242 Spring 2017Assignment Chapte.docx
 
Algo-Exercises-2-hash-AVL-Tree.ppt
Algo-Exercises-2-hash-AVL-Tree.pptAlgo-Exercises-2-hash-AVL-Tree.ppt
Algo-Exercises-2-hash-AVL-Tree.ppt
 
Data structures
Data structuresData structures
Data structures
 
Plc (1)
Plc (1)Plc (1)
Plc (1)
 
lecture 17
lecture 17lecture 17
lecture 17
 
Assignment4
Assignment4Assignment4
Assignment4
 
Algorithm Homework Help
Algorithm Homework HelpAlgorithm Homework Help
Algorithm Homework Help
 
Dinive conquer algorithm
Dinive conquer algorithmDinive conquer algorithm
Dinive conquer algorithm
 

Plus de Yasuo Tabei

Space-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsYasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesYasuo Tabei
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabeiYasuo Tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009Yasuo Tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 publicYasuo Tabei
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicYasuo Tabei
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicYasuo Tabei
 

Plus de Yasuo Tabei (14)

Space-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment KernelsSpace-efficient Feature Maps for String Alignment Kernels
Space-efficient Feature Maps for String Alignment Kernels
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
GIW2013
GIW2013GIW2013
GIW2013
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 

Dernier

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Dernier (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Gwt sdm public

  • 1. 2011 SIAM International Conference on Data Mining, Apr. 28, 2011 Kernel-based similarity search in massive graph databases with wavelet trees Yasuo Tabei and Koji Tsuda JST ERATO Minato Project, Japan AIST Comp Bio Research Center, Japan
  • 2. Graph similarity search n  Similarity search for 25 million molecular graphs •  Find all graphs whose similarity to the query ≥ 1 − •  Similarity = Weisfeiler-Lehman graph kernel (NIPS,2009) n  Use data structure called “Wavelet Tree” (SODA,2003) •  Self-index of an integer array •  Combination of rank dictionaries of bit arrays •  Enables fast array operations •  e.g., range minimum query, range intersection n  First review wavelet tree in the range intersection problem, then graph similarity search 11/05/12 2
  • 3. Range intersection on array n  Array A of length N, 1 ≤ Ai ≤ M i j k  A 1 3 6 8 2 5 7 1 2 7 4 5 n  Range intersection: rint(A,[i,j],[k,]) •  Find set intersection of A[i,j] and A[k,] •  The naïve solution is to concatenate and sort n  Use an index structure of the array (Wavelet Tree) and solve the problem faster! 11/05/12 3
  • 4. Tree of subarrays: Lower half = left, Higher half=right [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [7,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8
  • 5. Remember if each element is either in lower half (0) or higher half (1) [1,8] 0 0 1 1 0 1 1 0 1 0 0 1 [1,4] [5,8] 0 1 0 0 0 1 0 1 0 1 1 0 [1,2] [3,4] [5,6] [7,8] 0 1 0 1 0 1 1 0 0 1 0 0 1 2 3 4 5 6 7 8 11/05/12 5
  • 6. Index each bit array with a rank dictionary n  With rank dictionary, the rank operation can be done in O(1) time •  rankc(B,i): return the number of c {0,1} in B[1…i] n  Several algorithms known: rank9sel Example) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1(B,8)=5 011001110 0 rank0(B,5)=3 0 1 1 0 0 1 1 1 0 0 11/05/12 6
  • 7. Implementation of rank dictionary n  Divide the bit array B into B large blocks of length l=log2n RL=Ranks of large blocks RL n  Divide each large block to small blocks of length s=logn/2 Rs=Ranks of small blocks RS relative to the large block rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank) Time:O(1) Memory: n +o(n) bits
  • 8. O(1)-time division of an interval n  Using the rank operations, the division of an interval can be done in constant time •  rank0 for left child and rank1 for right child n  Naïve = linear time to the total number of elements [1,8] Aroot 1 3 6 8 2 5 7 1 7 2 4 5 rank0 rank1 [1,4] [5,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5 11/05/12 8
  • 9. Fast computation of rank intersection by pruning Pruned [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 solution!!
  • 10. Graph Similarity Search n  Bag-of-words representation of graph •  Weisfeiler-Lehman procedure (NIPS, 2009), Hido and Kashima(2009),Wang et al.,(2009) W=(A,D,E,H) n  Cosine similarity query •  Find all graphs W whose cosine similarity (kernel) to the query Q is at least 1-ε 11/05/12 10
  • 11. Weisfeiler-Lehman Procedure (NIPS,09) n Convert a graph into a set of words (bag-of-words) i) Make a label set of adjacent A vertices ex) {E,A,D} B E ii) Sort ex) A,D,E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a R E unique value ex) B,A,D,E R D v) Assign the value as the new Bag-of-words {A,B,D,E,…,R,…} vertex label
  • 12. Semi-conjunctive query n  Cosine similarity query can be relaxed to the following form W s.t |W ∩ Q| ≥ k n  Find all graphs W which share at least k words to the query Q n  No false negatives n  False positives can easily be filtered out by cosine calculations 11/05/12 12
  • 13. Inverted index, Array, Wavelet Tree n  Graph database is represented as inverted index n  Concatenate all rows to make an array n  Index the array with wavelet tree Aroot 1 3 6 8 2 5 7 1 2 7 4 5 n  Semi-conjunctive query = Extension of range intersection Wavelet Tree •  Find graph ids which appear at least k times in the array 11/05/12 13
  • 14. Pruning condition n  Find all graphs W in the database whose cosine is larger than a threshold 1-ε | W !Q | W s.t K N (W,Q) = ! 1 ! |W | |Q | l  W, Q: bag-of-words of graphs n The above solution can be relaxed as follows, If K N (Q,W ) ! 1 ! , then |Q | (1! ! )2 | Q |!| W |! (1! ! )2 l  Can be used for pruning
  • 15. Complexity n  Time per query: O(τm) •  τ: the number of traversed nodes •  m: the number of bag-of-words in a query n  Memory: (1+α)N log n + M log N •  N: the number of all words in the database •  M: Maximum integer in the array •  n: the number of graphs •  α: overhead for rank dictionary (α=0.6) n  Inverted index takes Nlogn bits n  About 60% overhead to inverted index!
  • 16. Experiments n  25 million chemical compounds from PubChem database n  Evaluated search time and memory usage n  Cosine thresholds ε=0.3,0.35,0.4 n  Compare our method gWT to •  Inverted index (concatenate all intervals and sort) •  Sequential scan (compute kernel one by one)
  • 17. Search time 40 sec 38 sec 8 sec 3 sec 2 sec
  • 18. Memory usage 20GB
  • 20. Related work A lot of methods have been proposed so far. n  1.Graph Grep [Shasha et al., 02] 2.gIndex [Yan et al., 04] 3.Tree+Delta [Zhao et al., 07] 4.TreePi [Zhang et al., 07] 5.Gstring [Jiang et al., 07] 6.FG-Index [Cheng et al., 07] 7.GDIndex [Williams et al., 07] etc
  • 21. Related work n  Decompose graphs into a set of substructures •  graphs, trees, paths Decompose etc … n  Build a substructure based index n  Drawbacks Index •  Need substructure mining •  Do not scale to millions of graphs
  • 22. Summary n  Efficient similarity search method of massive graph databases n  Solve semi-conjunctive queries efficiently n  Our method is built on wavelet trees n  Use Weisfeiler-Lehman procedure to convert graphs into bag-of-words n  Applicable to 25 million graphs n  Software http://code.google.com/p/gwt/