SlideShare une entreprise Scribd logo
1  sur  22
ALSIP, Dec. 1 2011


Kernel-based similarity search
 in massive graph databases
     with wavelet trees
         Yasuo Tabei and Koji Tsuda
       JST ERATO Minato Project,
 National Institute of Advanced Industrial
         Science and Technology
Outline
• Overview
• Wavelet Tree
  ✓ Problem = Range intersection on array
• Graph similarity search
  ✓ Weisfeiler-Lehman kernel
  ✓ Apply wavelet tree

• Experiments
  ✓ Comparison to inverted index
  ✓ 25 million molecular graphs
Graph similarity search
• Similarity search for 25 million molecular
  graphs
 ✓ Find all graphs whose similarity to the query 1
 ✓ Similarity = Weisfeiler-Lehman kernel (NIPS, 2009)

• Use data-structure called “Wavelet
  Tree” (SODA, 2003)
 ✓ Self-index of an integer array
 ✓ Enable fast array operations
 ‣ e.g., range minimum query, range intersection
Range intersection on array
•   Array A of length N, 1     Ai    M

                   i    j      k   "
        A       1 3 6 8 2 5 7 1 2 7 4 5


• Range intersection: rint(A, [i,j],[k,])
    ✓ Find common elements of A[i,j] and A[k,]

• The naive method is to concatenate and sort
    Ex) concatenate:6,8,2,2,7 ⇛ sort:2,2,7,6,8

• Use wavelet tree and solve the problem faster
Tree of subarrays:
Lower half=left, Higher half=right
            [1,8]
                1 3 6 8 2 5 7 1 7 2 4 5

    [1,4]                                             [5,8]
            1 3 2 1 2 4                 6 8 5 7 7 5

  [1,2]                   [3,4] [5,6]                 [7,8]
     1 2 1 2          3 4          6 5 5        8 7 7


   1 1       2 2      3      4     5 5     6    7 7      8
Remember if each element is either
 in lower half(0) or higher half(1)
            [1,8]
                  0 0 1 1 0 1 1 0 1 0 0 1

    [1,4]                                             [5,8]
            0 1 0 0 0 1                 0 1 0 1 1 0


  [1,2]                   [3,4] [5,6]               [7,8]
        0 1 0 1           0 1      1 0 0        1 0 0


    1         2         3   4       5    6      7     8
Index each bit array
      with a rank dictionary
• Using rank dictionary, the rank operation can be
  performed in O(1) time
  ✓ rankc(B, i): return the number of c   {0, 1} in B[1,i]

• Several methods known: rank9sel (Vigna, 08)
• Example) B=0110011100
                    i 1 2 3 4 5 6 7 8 9 10
 rank1 (B, 8) = 5     011001110 0
 rank0 (B, 5) = 3     011001110 0
O(1)-division of an interval
• Using the rank operation, the division of an
•




  interval can be done in constant time
    ✓ rank0 for left child and rank1 for right child

•   Naive = linear time to the total number of elements


          [1,8]
           Aroot 1 3 6 8 2 5 7 1 7 2 4 5

                   rank0                       rank1
      [1,4]                      [5,8]
       Aleft   1 3 2 1 2 4        Aright 6 8 5 7 7 5
Fast computation of rank
   intersection by pruning
Pruned      [1,8]
                  1 3 6 8 2 5 7 1 7 2 4 5

    [1,4]                                                [5,8]
            1 3 2 1 2 4                    6 8 5 7 7 5

  [1,2]                      [3,4] [5,6]             [6,8]
     1 2 1 2             3 4          6 5 5        8 7 7


    1 1       2 2        3      4     5 5     6    7 7     8

            solution!!
Outline
• Overview
• Wavelet Tree
  ✓ Problem = Range intersection on array
• Graph similarity search
  ✓ Weisfeiler-Lehman kernel
  ✓ Apply wavelet tree

• Experiments
  ✓ Comparison to inverted index
  ✓ 25 million molecular graphs
Graph Similarity Search
• Bag-of-words representation of graph
   ✓ Weisfeiler-Lehman procedure (NIPS, 2009), Hido and
     Kashima (ICDM, 2009), Wang et al., (EDBT, 2009)


                      W=(A,D,E,H)


• Consine similarity query
 ✓ Find all graphs W whose cosine similarity (kernel) to
   the query Q is at least 1
Weisfeiler-Lehman Procedure (NIPS,09)
•   Convert a graph into a set of words (bag-of-words)
Semi-conjunctive query
• Cosine similarity query can be relaxed to
  the following form
           W s.t. |W         Q|      k
  ✓ Find all graphs W which share at least k words
    to the query Q

• No false negatives
• False positives can easily be filtered out by
  cosine calculations
Inverted index, Array, Wavelet Tree
                                • Inverted index is built from
                                  graph database
                                • Concatenate all rows to make
                                •




                                    an array
                                • Index the array with wavelet
                                •




                                  tree
Aroot 1 3 6 8 2 5 7 1 2 7 4 5
                                • Semi-conjunctive query =
                                •




                                    Extension of range intersection
    Wavelet Tree
                                    ✓ Find graph ids which appear at
                                      least k times in given intervals
Pruning search space
• Find all graphs W in the database whose cosine
    to a query Q is larger than a threshold 1
                          |W Q|
       W s.t. KN (W, Q) =                      1
                            W Q
    ✓ W,Q: bag-of-words of graphs
• The above solution can be relaxed as follows
•




    If KN (W, Q)       1   , then
                                         |Q|
          (1       ) |Q|
                   2
                           |W |
                                    (1         )2
    ✓ Can be used for pruning search space
Complexity
• Time per query: O(τm)
 •   τ: the number of traversed nodes
 •   m: the number of bag-of-words in a query

• Memory: (1+α)N log n + M log N
 •   N: the number of all words in the database
 •   M: Maximum integer in the array
 •   n: the number of graphs
 •   α: overhead for rank dictionary (α=0.6)

• Inverted index takes Nlog n bits
• About 60% overhead to inverted index!
Outline
• Overview
• A data-structure
  ✓ Wavelet Tree
• Graph similarity search
  ✓ Weisfeiler-Lehman kernel
  ✓ Apply wavelet tree

• Experiments
  ✓ Comparison to inverted index
  ✓ 25 million molecular graphs
Experiments

• 25 million chemical compounds from PubChem
  database
• Evaluate search time and memory usage
• Cosine threshold ε=0.3,0.35,0.4
• Compare our method gWT to
 ✓   Inverted index (concatenate all intervals and sort)
 ✓   Sequential scan (Compute similarity one by one)
Search time
              40 sec
              38 sec




              8 sec
              3 sec
              2 sec
Memory usage
               20GB
Construction time
                    7h
Summary
• Efficient similarity search method of
    massive graph databases
• Solve semi-conjunctive query efficiently
• Build on Wavelet Tree
• Use Weisfeiler-Lehman procedure to
    represent graphs as bag-of-words
• Applicable to 25 million graphs
• Software
•




    http://code.google.com/p/gwt

Contenu connexe

Tendances

6 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-746 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-74Alexander Decker
 
Some fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappingsSome fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappingsAlexander Decker
 
l1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic ApplicationsGrigory Yaroslavtsev
 
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsGroup {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsIJASRD Journal
 
A common fixed point of integral type contraction in generalized metric spacess
A  common fixed point of integral type contraction in generalized metric spacessA  common fixed point of integral type contraction in generalized metric spacess
A common fixed point of integral type contraction in generalized metric spacessAlexander Decker
 
Common fixed point theorem for occasionally weakly compatible mapping in q fu...
Common fixed point theorem for occasionally weakly compatible mapping in q fu...Common fixed point theorem for occasionally weakly compatible mapping in q fu...
Common fixed point theorem for occasionally weakly compatible mapping in q fu...Alexander Decker
 
11.the univalence of some integral operators
11.the univalence of some integral operators11.the univalence of some integral operators
11.the univalence of some integral operatorsAlexander Decker
 
The univalence of some integral operators
The univalence of some integral operatorsThe univalence of some integral operators
The univalence of some integral operatorsAlexander Decker
 
Coincidence points for mappings under generalized contraction
Coincidence points for mappings under generalized contractionCoincidence points for mappings under generalized contraction
Coincidence points for mappings under generalized contractionAlexander Decker
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flowsVjekoslavKovac1
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureVjekoslavKovac1
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilationsVjekoslavKovac1
 
Supersymmetric Q-balls and boson stars in (d + 1) dimensions
Supersymmetric Q-balls and boson stars in (d + 1) dimensionsSupersymmetric Q-balls and boson stars in (d + 1) dimensions
Supersymmetric Q-balls and boson stars in (d + 1) dimensionsJurgen Riedel
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisVjekoslavKovac1
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
Ecfft zk studyclub 9.9
Ecfft zk studyclub 9.9Ecfft zk studyclub 9.9
Ecfft zk studyclub 9.9Alex Pruden
 
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...Jurgen Riedel
 
Regularity and complexity in dynamical systems
Regularity and complexity in dynamical systemsRegularity and complexity in dynamical systems
Regularity and complexity in dynamical systemsSpringer
 
Skiena algorithm 2007 lecture04 elementary data structures
Skiena algorithm 2007 lecture04 elementary data structuresSkiena algorithm 2007 lecture04 elementary data structures
Skiena algorithm 2007 lecture04 elementary data structureszukun
 

Tendances (20)

6 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-746 adesh kumar tripathi -71-74
6 adesh kumar tripathi -71-74
 
Some fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappingsSome fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappings
 
l1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applications
 
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related GraphsGroup {1, −1, i, −i} Cordial Labeling of Product Related Graphs
Group {1, −1, i, −i} Cordial Labeling of Product Related Graphs
 
A common fixed point of integral type contraction in generalized metric spacess
A  common fixed point of integral type contraction in generalized metric spacessA  common fixed point of integral type contraction in generalized metric spacess
A common fixed point of integral type contraction in generalized metric spacess
 
Common fixed point theorem for occasionally weakly compatible mapping in q fu...
Common fixed point theorem for occasionally weakly compatible mapping in q fu...Common fixed point theorem for occasionally weakly compatible mapping in q fu...
Common fixed point theorem for occasionally weakly compatible mapping in q fu...
 
11.the univalence of some integral operators
11.the univalence of some integral operators11.the univalence of some integral operators
11.the univalence of some integral operators
 
The univalence of some integral operators
The univalence of some integral operatorsThe univalence of some integral operators
The univalence of some integral operators
 
Coincidence points for mappings under generalized contraction
Coincidence points for mappings under generalized contractionCoincidence points for mappings under generalized contraction
Coincidence points for mappings under generalized contraction
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
 
Supersymmetric Q-balls and boson stars in (d + 1) dimensions
Supersymmetric Q-balls and boson stars in (d + 1) dimensionsSupersymmetric Q-balls and boson stars in (d + 1) dimensions
Supersymmetric Q-balls and boson stars in (d + 1) dimensions
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysis
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cube
 
Ecfft zk studyclub 9.9
Ecfft zk studyclub 9.9Ecfft zk studyclub 9.9
Ecfft zk studyclub 9.9
 
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Jena Talk Mar ...
 
Regularity and complexity in dynamical systems
Regularity and complexity in dynamical systemsRegularity and complexity in dynamical systems
Regularity and complexity in dynamical systems
 
Skiena algorithm 2007 lecture04 elementary data structures
Skiena algorithm 2007 lecture04 elementary data structuresSkiena algorithm 2007 lecture04 elementary data structures
Skiena algorithm 2007 lecture04 elementary data structures
 

En vedette

Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicYasuo Tabei
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicYasuo Tabei
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabeiYasuo Tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009Yasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesYasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 publicYasuo Tabei
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法MapR Technologies Japan
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界Preferred Networks
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)Shirou Maruyama
 
Euruko 2009 - Software Craftsmanship
Euruko 2009 - Software CraftsmanshipEuruko 2009 - Software Craftsmanship
Euruko 2009 - Software CraftsmanshipPhillip Oertel
 
Smart%20 Manual%20rev20060403
Smart%20 Manual%20rev20060403Smart%20 Manual%20rev20060403
Smart%20 Manual%20rev20060403guest4fb07c
 

En vedette (20)

Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
GIW2013
GIW2013GIW2013
GIW2013
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
 
bigdata2012nlp okanohara
bigdata2012nlp okanoharabigdata2012nlp okanohara
bigdata2012nlp okanohara
 
Welcome To Design Tech
Welcome To Design TechWelcome To Design Tech
Welcome To Design Tech
 
Euruko 2009 - Software Craftsmanship
Euruko 2009 - Software CraftsmanshipEuruko 2009 - Software Craftsmanship
Euruko 2009 - Software Craftsmanship
 
Smart%20 Manual%20rev20060403
Smart%20 Manual%20rev20060403Smart%20 Manual%20rev20060403
Smart%20 Manual%20rev20060403
 

Similaire à Gwt presen alsip-20111201

is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesNAVER Engineering
 
Topological Sort Algorithm.pptx
Topological Sort Algorithm.pptxTopological Sort Algorithm.pptx
Topological Sort Algorithm.pptxMuhammadShafi89
 
splaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNN
splaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNNsplaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNN
splaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNNratnapatil14
 
Dsoop (co 221) 1
Dsoop (co 221) 1Dsoop (co 221) 1
Dsoop (co 221) 1Puja Koch
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 
Top k string similarity search
Top k string similarity searchTop k string similarity search
Top k string similarity searchChiao-Meng Huang
 
Design and analysis of algorithms question paper 2015 tutorialsduniya.com
Design and analysis of algorithms  question paper 2015   tutorialsduniya.comDesign and analysis of algorithms  question paper 2015   tutorialsduniya.com
Design and analysis of algorithms question paper 2015 tutorialsduniya.comTutorialsDuniya.com
 
Network analysis lecture
Network analysis lectureNetwork analysis lecture
Network analysis lectureSara-Jayne Terp
 

Similaire à Gwt presen alsip-20111201 (20)

SISAP17
SISAP17SISAP17
SISAP17
 
Data structures
Data structuresData structures
Data structures
 
Plc (1)
Plc (1)Plc (1)
Plc (1)
 
is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayes
 
Matlab lec1
Matlab lec1Matlab lec1
Matlab lec1
 
03 search blind
03 search blind03 search blind
03 search blind
 
Topological Sort Algorithm.pptx
Topological Sort Algorithm.pptxTopological Sort Algorithm.pptx
Topological Sort Algorithm.pptx
 
Splay tree
Splay treeSplay tree
Splay tree
 
splaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNN
splaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNNsplaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNN
splaytree-171227043127.pptx NNNNNNNNNNNNNNNNNNNNNNN
 
Mit6 094 iap10_lec04
Mit6 094 iap10_lec04Mit6 094 iap10_lec04
Mit6 094 iap10_lec04
 
Dsoop (co 221) 1
Dsoop (co 221) 1Dsoop (co 221) 1
Dsoop (co 221) 1
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Basic mathematics
Basic mathematicsBasic mathematics
Basic mathematics
 
Lec28
Lec28Lec28
Lec28
 
Top k string similarity search
Top k string similarity searchTop k string similarity search
Top k string similarity search
 
Design and analysis of algorithms question paper 2015 tutorialsduniya.com
Design and analysis of algorithms  question paper 2015   tutorialsduniya.comDesign and analysis of algorithms  question paper 2015   tutorialsduniya.com
Design and analysis of algorithms question paper 2015 tutorialsduniya.com
 
Network analysis lecture
Network analysis lectureNetwork analysis lecture
Network analysis lecture
 
Plc (1)
Plc (1)Plc (1)
Plc (1)
 
Enter The Matrix
Enter The MatrixEnter The Matrix
Enter The Matrix
 
sorting
sortingsorting
sorting
 

Dernier

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Dernier (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Gwt presen alsip-20111201

  • 1. ALSIP, Dec. 1 2011 Kernel-based similarity search in massive graph databases with wavelet trees Yasuo Tabei and Koji Tsuda JST ERATO Minato Project, National Institute of Advanced Industrial Science and Technology
  • 2. Outline • Overview • Wavelet Tree ✓ Problem = Range intersection on array • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  • 3. Graph similarity search • Similarity search for 25 million molecular graphs ✓ Find all graphs whose similarity to the query 1 ✓ Similarity = Weisfeiler-Lehman kernel (NIPS, 2009) • Use data-structure called “Wavelet Tree” (SODA, 2003) ✓ Self-index of an integer array ✓ Enable fast array operations ‣ e.g., range minimum query, range intersection
  • 4. Range intersection on array • Array A of length N, 1 Ai M i j k " A 1 3 6 8 2 5 7 1 2 7 4 5 • Range intersection: rint(A, [i,j],[k,]) ✓ Find common elements of A[i,j] and A[k,] • The naive method is to concatenate and sort Ex) concatenate:6,8,2,2,7 ⇛ sort:2,2,7,6,8 • Use wavelet tree and solve the problem faster
  • 5. Tree of subarrays: Lower half=left, Higher half=right [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [7,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8
  • 6. Remember if each element is either in lower half(0) or higher half(1) [1,8] 0 0 1 1 0 1 1 0 1 0 0 1 [1,4] [5,8] 0 1 0 0 0 1 0 1 0 1 1 0 [1,2] [3,4] [5,6] [7,8] 0 1 0 1 0 1 1 0 0 1 0 0 1 2 3 4 5 6 7 8
  • 7. Index each bit array with a rank dictionary • Using rank dictionary, the rank operation can be performed in O(1) time ✓ rankc(B, i): return the number of c {0, 1} in B[1,i] • Several methods known: rank9sel (Vigna, 08) • Example) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1 (B, 8) = 5 011001110 0 rank0 (B, 5) = 3 011001110 0
  • 8. O(1)-division of an interval • Using the rank operation, the division of an • interval can be done in constant time ✓ rank0 for left child and rank1 for right child • Naive = linear time to the total number of elements [1,8] Aroot 1 3 6 8 2 5 7 1 7 2 4 5 rank0 rank1 [1,4] [5,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 9. Fast computation of rank intersection by pruning Pruned [1,8] 1 3 6 8 2 5 7 1 7 2 4 5 [1,4] [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,8] 1 2 1 2 3 4 6 5 5 8 7 7 1 1 2 2 3 4 5 5 6 7 7 8 solution!!
  • 10. Outline • Overview • Wavelet Tree ✓ Problem = Range intersection on array • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  • 11. Graph Similarity Search • Bag-of-words representation of graph ✓ Weisfeiler-Lehman procedure (NIPS, 2009), Hido and Kashima (ICDM, 2009), Wang et al., (EDBT, 2009) W=(A,D,E,H) • Consine similarity query ✓ Find all graphs W whose cosine similarity (kernel) to the query Q is at least 1
  • 12. Weisfeiler-Lehman Procedure (NIPS,09) • Convert a graph into a set of words (bag-of-words)
  • 13. Semi-conjunctive query • Cosine similarity query can be relaxed to the following form W s.t. |W Q| k ✓ Find all graphs W which share at least k words to the query Q • No false negatives • False positives can easily be filtered out by cosine calculations
  • 14. Inverted index, Array, Wavelet Tree • Inverted index is built from graph database • Concatenate all rows to make • an array • Index the array with wavelet • tree Aroot 1 3 6 8 2 5 7 1 2 7 4 5 • Semi-conjunctive query = • Extension of range intersection Wavelet Tree ✓ Find graph ids which appear at least k times in given intervals
  • 15. Pruning search space • Find all graphs W in the database whose cosine to a query Q is larger than a threshold 1 |W Q| W s.t. KN (W, Q) = 1 W Q ✓ W,Q: bag-of-words of graphs • The above solution can be relaxed as follows • If KN (W, Q) 1 , then |Q| (1 ) |Q| 2 |W | (1 )2 ✓ Can be used for pruning search space
  • 16. Complexity • Time per query: O(τm) • τ: the number of traversed nodes • m: the number of bag-of-words in a query • Memory: (1+α)N log n + M log N • N: the number of all words in the database • M: Maximum integer in the array • n: the number of graphs • α: overhead for rank dictionary (α=0.6) • Inverted index takes Nlog n bits • About 60% overhead to inverted index!
  • 17. Outline • Overview • A data-structure ✓ Wavelet Tree • Graph similarity search ✓ Weisfeiler-Lehman kernel ✓ Apply wavelet tree • Experiments ✓ Comparison to inverted index ✓ 25 million molecular graphs
  • 18. Experiments • 25 million chemical compounds from PubChem database • Evaluate search time and memory usage • Cosine threshold ε=0.3,0.35,0.4 • Compare our method gWT to ✓ Inverted index (concatenate all intervals and sort) ✓ Sequential scan (Compute similarity one by one)
  • 19. Search time 40 sec 38 sec 8 sec 3 sec 2 sec
  • 20. Memory usage 20GB
  • 22. Summary • Efficient similarity search method of massive graph databases • Solve semi-conjunctive query efficiently • Build on Wavelet Tree • Use Weisfeiler-Lehman procedure to represent graphs as bag-of-words • Applicable to 25 million graphs • Software • http://code.google.com/p/gwt