SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
A Quantitative analysis and performance study for similarity
                 search methods in high-dimensional spaces




                                          Group 4
                                             Seokhwan Eom,
                                               Jungyeol Lee,
                                                   Rina You,
                                                  Kilho Lee,
Presenter: Seokhwan Eom




Contents

•   Introduction
•   Observations
•   Analysis of NN-search
•   VA-file
•   Conclusion




        2
Presenter: Seokhwan Eom




The Similarity Search Paradigm




      3       ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
Presenter: Seokhwan Eom




The Similarity Search Paradigm




  Locate closest point to query object, i.e. its nearest neighbor(NN)



         4             ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
Presenter: Seokhwan Eom




The conventional approach

• Space-partitioning methods
   - Gridfile   [Nievergelt:1984]
   - K-D-B tree [Robinson:1981]
   - Quad tree [Finkel:1974]

• Data-partitioning index trees
   -R-tree         [Guttman:1984]   -R+-tree    [Sellis:1987]
   -R*-tree       [Beckmann:1990]   -X-tree    [Berchtold:1996]
   -SR-tree       [Katayama:1997]   -M-tree    [Ciaccia:1996]
   -TV-tree       [Lin:1994]        -hB-tree    [Lomet:1990]
Unfortunately,
As the number of dimensions increases, their performance degrades.
- The dimensional curse

              5
Presenter: Seokhwan Eom




Contribution

• Assumptions : initially uniformly-distributed data within unit
  hypercube with independent dimensions

1.   Establish lower bounds on the average performance of NN-
     search for space- and data-partitioning, and clustering
     structures.

2.   Show formally that any partitioning scheme and clustering
     technique must degenerate to a sequential scan through all
     their blocks if the number of dimension is sufficiently large.

3.   Present performance results which support their analysis, and
     demonstrate that the performance of VA-file offers the best
     performance in practice whenever the number of dimensions is
     larger than 6.


           6
Presenter: Seokhwan Eom



The Difficulties of High Dimensionality
• Observation 1 (Number of partitions)
A simple partitioning scheme :
 split the data space in each dimension into two halves.




This seems reasonable with low dimensions.
But with d = 100 there are 2100 ≒ 1030 partitions;
 even with 106 points, almost all of the partitions(1024) are empty.


           7
Presenter: Seokhwan Eom



The Difficulties of High Dimensionality
• Observation 2 (Data space is sparsely populated)
 Consider a hyper-cube range query with size s=0.95
     Data space Ω=[0,1]d

      Target region



                           s



               s



 At d=100,
         P d [ s]  s d  0.95100  0.0059


          8
Presenter: Seokhwan Eom



  The Difficulties of High Dimensionality
  • Observation 3 (Spherical range queries)
     The probability that an arbitrary point R lies within the largest
      spherical query.




Figure: Largest range query      Table: Probability that a point lies within the largest
entirely within the data space. range query inside Ω, and the expected database size

                9
Presenter: Seokhwan Eom



The Difficulties of High Dimensionality
• Observation 4 (Exponentially growing DB size)
 The size which a data set would have to have such that, on average,
                                             d
  at least one point falls into the sphere sp (Q,0.5) (for even d):




               Table: Probability that a point lies within the largest
              range query inside Ω, and the expected database size

         10
Presenter: Seokhwan Eom



The Difficulties of High Dimensionality
• Observation 5 (Expected NN-distance)
The probability that the NN-distance is at most r(i.e. the probability that NN to query
   point Q is contained in spd (Q,r)):




The expected NN-distance for a query point Q :




The expected NN-distance E[nndist] for any query point in the data space :




            11
Presenter: Seokhwan Eom



The Difficulties of High Dimensionality
• Observation 5 (Expected NN-distance)




1.   The NN-distance grows steadily with d
2.   Beyond trivially-small data sets D, NN-distances decrease only
     marginally as the size of D increases.

           12
Presenter: Jungyeol Lee




Analysis of NN-Search

• The complexity of any partitioning and clustering
  scheme converges to O( N ) with increasing
  dimensionality

•   General Cost Model
•   Space-Partitioning Methods
•   Data-Partitioning Methods
•   General Partitioning and Clustering Schemes




         13
Presenter: Jungyeol Lee




General Cost Model

• ‘Cost’ of a query:
  – the number of blocks which must be accessed
• Optimal NN search algorithm:
  – Blocks visited during the search
      = blocks whose MBR1) intersect the NN-sphere




   1) MBR: Minimum Bounding Regions
           14
Presenter: Jungyeol Lee




General Cost Model

• Let M visit be the number of blocks visited.
• M visit = The number of blocks
             which intersect the sp d (Q, E[nndist ])
• Transform the spherical query into a point query
• Minkowski sum, MSum(mbri , E[nndist ])
                       E[nn dist ]



              mbri




        MSum(mbri , E[nndist ])


        15
Presenter: Jungyeol Lee




General Cost Model

• Transform the spherical query into a point
  query




• Probability that the i -th block must be visit
        Pvisit [i]  Vol (MSum(mbri , E[nndist ])  )
                                        N m
•   M visit   
                N avg
                  Pvisit , Pvisit 
                             avg    m
                                        P     visit   [i ]
                m                   N   i 0
              16
Presenter: Jungyeol Lee




Space-Partitioning Methods

• Dividing  regardless of clusters
• If each dimension is split once,
  the total # of partitions: 2 , the space overhead: O(2 )
                              d                         d


• To reduce the space overhead, only d '  d dimensions
  are split such that, on average, m points are assigned
  to a partition
          N               N
      2   ,
       d'
                 d '  log 2 
          m               m




        17
Presenter: Jungyeol Lee




Space-Partitioning Methods

• Let lmax denote the maximum distance from mbri to
  any point in the data space
                                 N
                      d '  log 2 
                                 m

                               1      1       N
                      lmax      d'      log 2 
                               2      2 
                                              m

• lmax  E[nndist ], at some dimensionality
• From that dimensionality, Minkowski sum covers the
  entire data space
• Pvisit converges into 1 same as sequential scan
        18
Presenter: Jungyeol Lee




Space-Partitioning Methods

• Pvisit [i]  Vol (MSum(mbri , E[nndist ])  )  1
• Fig. 7 Comparison of lmax with E[nndist ]




          19
Presenter: Rina You




Data-Partitioning Methods

• Data-partitioning methods partition the data
  space hierarchically
   – In order to reduce the search cost from N  to log N 


• Impracticability of existing methods for NN-
  search in HDVSs.
   – A sequential scan out-performed these more sophisticated
     hierarchical methods.




         20
Presenter: Rina You




Rectangular MBRs

• Index methods use hyper-cubes to bound the
  region of a block.

• Splitting a node results in two new, equally-full
  partitions of the data space.
• d’ dimensions are split at high dimensionality

                             N
                    d  log 2 
                     '

                             m


       21
Presenter: Rina You




Rectangular MBRs

• rectangular MBR
  – d’ sides with a length of 1/2
  – d - d’ sides with a length of 1.


• the probability of visiting a block during
  NN-search
  : the volume of that part of the extended box in the data
  space




        22
Presenter: Rina You




Rectangular MBRs

• the probability of accessing a block during a
  NN-search
  – different database sizes and different values of d’




       23
Presenter: Rina You




Spherical MBRs

• Another group of index structures
   – MBRs in the form of hyper-spheres.

• Each block of optimal structure consists of
   – the center point C
   – m - 1 nearest neighbors


• MBR can be described by nn              sp, m 1
                                                     C 

         24
Presenter: Rina You




Spherical MBRs

• The probability of accessing a block during
  the search.

• MBRs in the form of hyper-spheres :                       nn     sp, m 1
                                                                              C 
• use a Minkowski sum
      d
  sp C, nn        dist, m1
                               c  Enn   dist
                                                   
• The probability that block                  i     must be visited
  during a NN-search
  P sp
   visit   i  Vol sp C, nn
                           d         dist,m1
                                                   c  Enn  
                                                            dist


              25
Presenter: Rina You




Spherical MBRs

• another lower bound for this probability
   – replace nn dist,m1 by nn dist,1  Enn dist 

   P    sp
       visit   i  Vol sp C,2  Enn  
                         d            dist




• If i increases, nn dist,i does not decrease.
   –
      j  i : nn  dist, j
                            nn dist,i



               26
Presenter: Rina You




Spherical MBRs

• The probability of accessing a block
  during the search
  – average the above probability over all center
    points C   :

    P sp, avg
     visit               Vol spc,2  Enn  dC
                     C




        27
Presenter: Rina You




Spherical MBRs

• percentage of blocks visited increases rapidly
  with the dimensionality




• sequential scan will perform better in practice
       28
Presenter: Rina You


General Partitioning and Clustering
Schemes

• No partitioning or clustering scheme
  can offer efficient NN-search
  – if the number of dimensions becomes large.


• The complexity of methods : ON 
• A large portion (up to 100%) of data
  blocks must be read
  – In order to determine the nearest neighbor.

      29
Presenter: Rina You


General Partitioning and Clustering
Schemes

• Basic assumptions:
  1. A cluster is a geometrical form (MBR) that
    covers all cluster points
  2. Each cluster contains at least two points
  3. The MBR of a cluster is convex.




      30
Presenter: Rina You


General Partitioning and Clustering
Schemes

• Average probability of accessing a cluster
  during an NN-search
              1 l
   p avg
     visit    VM mbrCi 
              l i 1

                       
  VM x   Vol MSum x, E[nn    dist
                                       ]   

        31
Presenter: Rina You


General Partitioning and Clustering
Schemes
• Lower bound the average probability
  of accessing a line cluster.
• Pick two arbitrary data points
  – each cluster contains at least two points
• line  Ai, Bi  is contained in mbr Ci 
   – mbr Ci  is convex.
• Lower bound the volume of the
  extended mbr Ci 
  : VM mbrCi   VM line  Ai, Bi 
       32
Presenter: Rina You


General Partitioning and Clustering
Schemes

• Lower bound the distance between Ai
  and Bi : VM line ( Ai, Bi )   VM line ( Ai, Pi ) 
                         min             VM (line ( Ai, Qi ))
                    Qsurf ( nn ( Ai ))
                             sp


               With Pi  surf (nn sp ( Ai ))
   – Points in surface of nn-sphere of Ai have
     minimal minkowski sum for line(Ai, Bi)
   – Line(Ai, Pi) is the optimal line cluster for
     point A
      • If Pi is point in surface of nn-sphere of Ai.
        33
Presenter: Rina You


General Partitioning and Clustering
Schemes

• Lower bound the average probability
  of accessing a line clusters
            1 l
    avg
  Pvisit    VM (mbr(Ci ))   VM (line ( A, P( A)))dA
            l i 1           A

  – Calculate the average volume of minkowski
    sums over all possible pairs A and P(A) in
    the data space



           34
Presenter: Rina You


General Partitioning and Clustering
Schemes

• Conclusion 1 (Performance)
  – For any clustering and partitioning method,
     a simple sequential scan performs better.
   if the number of dimensions exceeds some d.


• Conclusion 2 (Complexity)
  – The complexity of any clustering and
    partitioning methods tends towards O(N)
    as dimensionality increases.
      35
Presenter: Rina You


General Partitioning and Clustering
Schemes

• Conclusion 3 (Degeneration)
  – All blocks are accessed
   if the number of dimensions exceeds some d




      36
Presenter: Kilho Lee




The VA-file

• Accelerates that unavoidable scan by using object
  approximations to compress the vector data.
• Reduces the amount of data that must be read during
  similarity searches.

• Compressing vector data
• The filtering step
• Accessing the data




       37
Presenter: Kilho Lee


The VA-file
 Compressing vector data


                                                                          1 d
                                       P["in _ cell " ]  Vol (cell )  ( bi )  2b
                                                                         2
                                                                b N 1    N
                                          P[ Share]  1  (1  2 )         b
                                                                           2



  • For each dimension i, a small number of bits (bi) is assigned
  • Let b be the sum of all bi’s, b  i 1 bi
                                       d


  • The data space is divided into 2b



        38
Presenter: Kilho Lee


The VA-file
 Filtering step




  • When searching for the nearest neighbor, the entire approximation file
    is scanned and upper and lower bounds on the distance to the query
  • Let δ is the smallest upper bound found so far.
  • if a approx has lower bound exceeds δ, it will be filtered.


        39             ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
Presenter: Kilho Lee


The VA-file
 Filtering step




  • After the filtering step, less than 0.1% of vectors remaining.




         40
Presenter: Kilho Lee


The VA-file
 Accessing the vector




  • After the filtering step, a small set of candidates remain.
  • candidates are sorted by lower bound
  • If a lower bound is encountered that exceeds the nearest distance seen
    so far, the VA-file method stops.


        41
Presenter: Kilho Lee


The VA-file
 Accessing the vector




  • less than 1% of vector blocks are visited.
  • In d = 50, bi = 6, N = 500,000 case, only 20 vectors are accessed.




        42
Presenter: Kilho Lee




Performance




  •Figure depicts the percentage of blocks visited.




        43
Presenter: Kilho Lee




Conclusion




  • conventional indexing methods are out-performed by a
    simple sequential scan at moderate dimensionality ( d = 10)
  • At moderate and high dimensionality ( d ≥ 6 ), the VA-file method
    can out-perform any other method.

        44
45

Contenu connexe

Tendances

Binarization of Ancient Document Images based on Multipeak Histogram Assumption
Binarization of Ancient Document Images based on Multipeak Histogram AssumptionBinarization of Ancient Document Images based on Multipeak Histogram Assumption
Binarization of Ancient Document Images based on Multipeak Histogram AssumptionTELKOMNIKA JOURNAL
 
Dochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videos
Dochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videosDochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videos
Dochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videosEvans Marshall
 
Optic flow estimation with deep learning
Optic flow estimation with deep learningOptic flow estimation with deep learning
Optic flow estimation with deep learningYu Huang
 
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataAnalysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataIOSRjournaljce
 
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of MapsCPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of MapsNAVER Engineering
 
Passive stereo vision with deep learning
Passive stereo vision with deep learningPassive stereo vision with deep learning
Passive stereo vision with deep learningYu Huang
 
Jenner_ST_FB_Jan_2016
Jenner_ST_FB_Jan_2016Jenner_ST_FB_Jan_2016
Jenner_ST_FB_Jan_2016Hans Ecke
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial Ligeng Zhu
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learningYu Huang
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsSungchul Kim
 
Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...NAVER Engineering
 
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionSpeech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionIOSRJVSP
 
TENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERTENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERcscpconf
 
Satellite image compression technique
Satellite image compression techniqueSatellite image compression technique
Satellite image compression techniqueacijjournal
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesArvind Rapaka
 

Tendances (20)

poster
posterposter
poster
 
Binarization of Ancient Document Images based on Multipeak Histogram Assumption
Binarization of Ancient Document Images based on Multipeak Histogram AssumptionBinarization of Ancient Document Images based on Multipeak Histogram Assumption
Binarization of Ancient Document Images based on Multipeak Histogram Assumption
 
Dochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videos
Dochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videosDochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videos
Dochelp.net-video-google-a-text-retrieval-approach-to-object-matching-in-videos
 
Optic flow estimation with deep learning
Optic flow estimation with deep learningOptic flow estimation with deep learning
Optic flow estimation with deep learning
 
Class Weighted Convolutional Features for Image Retrieval
Class Weighted Convolutional Features for Image Retrieval Class Weighted Convolutional Features for Image Retrieval
Class Weighted Convolutional Features for Image Retrieval
 
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataAnalysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
 
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of MapsCPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps
 
Passive stereo vision with deep learning
Passive stereo vision with deep learningPassive stereo vision with deep learning
Passive stereo vision with deep learning
 
Jenner_ST_FB_Jan_2016
Jenner_ST_FB_Jan_2016Jenner_ST_FB_Jan_2016
Jenner_ST_FB_Jan_2016
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learning
 
Score based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential EquationsScore based Generative Modeling through Stochastic Differential Equations
Score based Generative Modeling through Stochastic Differential Equations
 
Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...Seed net automatic seed generation with deep reinforcement learning for robus...
Seed net automatic seed generation with deep reinforcement learning for robus...
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionSpeech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
 
TENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERTENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIER
 
Satellite image compression technique
Satellite image compression techniqueSatellite image compression technique
Satellite image compression technique
 
robio-2014-falquez
robio-2014-falquezrobio-2014-falquez
robio-2014-falquez
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 

Similaire à A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces

ODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleKuldeep Jiwani
 
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix Decomposition
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix DecompositionBeyond Low Rank + Sparse: Multi-scale Low Rank Matrix Decomposition
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix DecompositionFrank Ong
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentShubham Joshi
 
NovelD: A Simple yet Effective Exploration Criterion
NovelD: A Simple yet Effective Exploration CriterionNovelD: A Simple yet Effective Exploration Criterion
NovelD: A Simple yet Effective Exploration CriterionSeungjoon1
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
Topological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsTopological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsMason Porter
 
Pivot Selection Techniques
Pivot Selection TechniquesPivot Selection Techniques
Pivot Selection TechniquesCatarina Moreira
 
large_scale_search.pdf
large_scale_search.pdflarge_scale_search.pdf
large_scale_search.pdfEmerald72
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Sean Moran
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화NAVER Engineering
 
Digital Image Fundamentals
Digital Image FundamentalsDigital Image Fundamentals
Digital Image FundamentalsKalyan Acharjya
 
Example of iterative deepening search & bidirectional search
Example of iterative deepening search & bidirectional searchExample of iterative deepening search & bidirectional search
Example of iterative deepening search & bidirectional searchAbhijeet Agarwal
 

Similaire à A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces (20)

ODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scale
 
Nearest neighbor search
Nearest neighbor searchNearest neighbor search
Nearest neighbor search
 
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix Decomposition
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix DecompositionBeyond Low Rank + Sparse: Multi-scale Low Rank Matrix Decomposition
Beyond Low Rank + Sparse: Multi-scale Low Rank Matrix Decomposition
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
 
NovelD: A Simple yet Effective Exploration Criterion
NovelD: A Simple yet Effective Exploration CriterionNovelD: A Simple yet Effective Exploration Criterion
NovelD: A Simple yet Effective Exploration Criterion
 
Spectral convnets
Spectral convnetsSpectral convnets
Spectral convnets
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Topological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsTopological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial Systems
 
Lecture24
Lecture24Lecture24
Lecture24
 
Pivot Selection Techniques
Pivot Selection TechniquesPivot Selection Techniques
Pivot Selection Techniques
 
Db Scan
Db ScanDb Scan
Db Scan
 
large_scale_search.pdf
large_scale_search.pdflarge_scale_search.pdf
large_scale_search.pdf
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
 
KNN
KNNKNN
KNN
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 
Digital Image Fundamentals
Digital Image FundamentalsDigital Image Fundamentals
Digital Image Fundamentals
 
Example of iterative deepening search & bidirectional search
Example of iterative deepening search & bidirectional searchExample of iterative deepening search & bidirectional search
Example of iterative deepening search & bidirectional search
 

Dernier

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 

Dernier (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces

  • 1. A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces Group 4 Seokhwan Eom, Jungyeol Lee, Rina You, Kilho Lee,
  • 2. Presenter: Seokhwan Eom Contents • Introduction • Observations • Analysis of NN-search • VA-file • Conclusion 2
  • 3. Presenter: Seokhwan Eom The Similarity Search Paradigm 3 ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
  • 4. Presenter: Seokhwan Eom The Similarity Search Paradigm Locate closest point to query object, i.e. its nearest neighbor(NN) 4 ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
  • 5. Presenter: Seokhwan Eom The conventional approach • Space-partitioning methods - Gridfile [Nievergelt:1984] - K-D-B tree [Robinson:1981] - Quad tree [Finkel:1974] • Data-partitioning index trees -R-tree [Guttman:1984] -R+-tree [Sellis:1987] -R*-tree [Beckmann:1990] -X-tree [Berchtold:1996] -SR-tree [Katayama:1997] -M-tree [Ciaccia:1996] -TV-tree [Lin:1994] -hB-tree [Lomet:1990] Unfortunately, As the number of dimensions increases, their performance degrades. - The dimensional curse 5
  • 6. Presenter: Seokhwan Eom Contribution • Assumptions : initially uniformly-distributed data within unit hypercube with independent dimensions 1. Establish lower bounds on the average performance of NN- search for space- and data-partitioning, and clustering structures. 2. Show formally that any partitioning scheme and clustering technique must degenerate to a sequential scan through all their blocks if the number of dimension is sufficiently large. 3. Present performance results which support their analysis, and demonstrate that the performance of VA-file offers the best performance in practice whenever the number of dimensions is larger than 6. 6
  • 7. Presenter: Seokhwan Eom The Difficulties of High Dimensionality • Observation 1 (Number of partitions) A simple partitioning scheme : split the data space in each dimension into two halves. This seems reasonable with low dimensions. But with d = 100 there are 2100 ≒ 1030 partitions; even with 106 points, almost all of the partitions(1024) are empty. 7
  • 8. Presenter: Seokhwan Eom The Difficulties of High Dimensionality • Observation 2 (Data space is sparsely populated) Consider a hyper-cube range query with size s=0.95 Data space Ω=[0,1]d Target region s s At d=100, P d [ s]  s d  0.95100  0.0059 8
  • 9. Presenter: Seokhwan Eom The Difficulties of High Dimensionality • Observation 3 (Spherical range queries) The probability that an arbitrary point R lies within the largest spherical query. Figure: Largest range query Table: Probability that a point lies within the largest entirely within the data space. range query inside Ω, and the expected database size 9
  • 10. Presenter: Seokhwan Eom The Difficulties of High Dimensionality • Observation 4 (Exponentially growing DB size) The size which a data set would have to have such that, on average, d at least one point falls into the sphere sp (Q,0.5) (for even d): Table: Probability that a point lies within the largest range query inside Ω, and the expected database size 10
  • 11. Presenter: Seokhwan Eom The Difficulties of High Dimensionality • Observation 5 (Expected NN-distance) The probability that the NN-distance is at most r(i.e. the probability that NN to query point Q is contained in spd (Q,r)): The expected NN-distance for a query point Q : The expected NN-distance E[nndist] for any query point in the data space : 11
  • 12. Presenter: Seokhwan Eom The Difficulties of High Dimensionality • Observation 5 (Expected NN-distance) 1. The NN-distance grows steadily with d 2. Beyond trivially-small data sets D, NN-distances decrease only marginally as the size of D increases. 12
  • 13. Presenter: Jungyeol Lee Analysis of NN-Search • The complexity of any partitioning and clustering scheme converges to O( N ) with increasing dimensionality • General Cost Model • Space-Partitioning Methods • Data-Partitioning Methods • General Partitioning and Clustering Schemes 13
  • 14. Presenter: Jungyeol Lee General Cost Model • ‘Cost’ of a query: – the number of blocks which must be accessed • Optimal NN search algorithm: – Blocks visited during the search = blocks whose MBR1) intersect the NN-sphere 1) MBR: Minimum Bounding Regions 14
  • 15. Presenter: Jungyeol Lee General Cost Model • Let M visit be the number of blocks visited. • M visit = The number of blocks which intersect the sp d (Q, E[nndist ]) • Transform the spherical query into a point query • Minkowski sum, MSum(mbri , E[nndist ]) E[nn dist ] mbri MSum(mbri , E[nndist ]) 15
  • 16. Presenter: Jungyeol Lee General Cost Model • Transform the spherical query into a point query • Probability that the i -th block must be visit Pvisit [i]  Vol (MSum(mbri , E[nndist ])  ) N m • M visit  N avg Pvisit , Pvisit  avg m P visit [i ] m N i 0 16
  • 17. Presenter: Jungyeol Lee Space-Partitioning Methods • Dividing  regardless of clusters • If each dimension is split once, the total # of partitions: 2 , the space overhead: O(2 ) d d • To reduce the space overhead, only d '  d dimensions are split such that, on average, m points are assigned to a partition N   N 2   , d' d '  log 2  m  m 17
  • 18. Presenter: Jungyeol Lee Space-Partitioning Methods • Let lmax denote the maximum distance from mbri to any point in the data space  N d '  log 2   m 1 1  N lmax  d'  log 2  2 2   m • lmax  E[nndist ], at some dimensionality • From that dimensionality, Minkowski sum covers the entire data space • Pvisit converges into 1 same as sequential scan 18
  • 19. Presenter: Jungyeol Lee Space-Partitioning Methods • Pvisit [i]  Vol (MSum(mbri , E[nndist ])  )  1 • Fig. 7 Comparison of lmax with E[nndist ] 19
  • 20. Presenter: Rina You Data-Partitioning Methods • Data-partitioning methods partition the data space hierarchically – In order to reduce the search cost from N  to log N  • Impracticability of existing methods for NN- search in HDVSs. – A sequential scan out-performed these more sophisticated hierarchical methods. 20
  • 21. Presenter: Rina You Rectangular MBRs • Index methods use hyper-cubes to bound the region of a block. • Splitting a node results in two new, equally-full partitions of the data space. • d’ dimensions are split at high dimensionality  N d  log 2  '  m 21
  • 22. Presenter: Rina You Rectangular MBRs • rectangular MBR – d’ sides with a length of 1/2 – d - d’ sides with a length of 1. • the probability of visiting a block during NN-search : the volume of that part of the extended box in the data space 22
  • 23. Presenter: Rina You Rectangular MBRs • the probability of accessing a block during a NN-search – different database sizes and different values of d’ 23
  • 24. Presenter: Rina You Spherical MBRs • Another group of index structures – MBRs in the form of hyper-spheres. • Each block of optimal structure consists of – the center point C – m - 1 nearest neighbors • MBR can be described by nn sp, m 1 C  24
  • 25. Presenter: Rina You Spherical MBRs • The probability of accessing a block during the search. • MBRs in the form of hyper-spheres : nn sp, m 1 C  • use a Minkowski sum d sp C, nn dist, m1 c  Enn dist  • The probability that block i must be visited during a NN-search P sp visit i  Vol sp C, nn d dist,m1 c  Enn   dist 25
  • 26. Presenter: Rina You Spherical MBRs • another lower bound for this probability – replace nn dist,m1 by nn dist,1  Enn dist  P sp visit i  Vol sp C,2  Enn   d dist • If i increases, nn dist,i does not decrease. – j  i : nn dist, j  nn dist,i 26
  • 27. Presenter: Rina You Spherical MBRs • The probability of accessing a block during the search – average the above probability over all center points C   : P sp, avg visit  Vol spc,2  Enn  dC C 27
  • 28. Presenter: Rina You Spherical MBRs • percentage of blocks visited increases rapidly with the dimensionality • sequential scan will perform better in practice 28
  • 29. Presenter: Rina You General Partitioning and Clustering Schemes • No partitioning or clustering scheme can offer efficient NN-search – if the number of dimensions becomes large. • The complexity of methods : ON  • A large portion (up to 100%) of data blocks must be read – In order to determine the nearest neighbor. 29
  • 30. Presenter: Rina You General Partitioning and Clustering Schemes • Basic assumptions: 1. A cluster is a geometrical form (MBR) that covers all cluster points 2. Each cluster contains at least two points 3. The MBR of a cluster is convex. 30
  • 31. Presenter: Rina You General Partitioning and Clustering Schemes • Average probability of accessing a cluster during an NN-search 1 l p avg visit  VM mbrCi  l i 1   VM x   Vol MSum x, E[nn dist ]  31
  • 32. Presenter: Rina You General Partitioning and Clustering Schemes • Lower bound the average probability of accessing a line cluster. • Pick two arbitrary data points – each cluster contains at least two points • line  Ai, Bi  is contained in mbr Ci  – mbr Ci  is convex. • Lower bound the volume of the extended mbr Ci  : VM mbrCi   VM line  Ai, Bi  32
  • 33. Presenter: Rina You General Partitioning and Clustering Schemes • Lower bound the distance between Ai and Bi : VM line ( Ai, Bi )   VM line ( Ai, Pi )   min VM (line ( Ai, Qi )) Qsurf ( nn ( Ai )) sp With Pi  surf (nn sp ( Ai )) – Points in surface of nn-sphere of Ai have minimal minkowski sum for line(Ai, Bi) – Line(Ai, Pi) is the optimal line cluster for point A • If Pi is point in surface of nn-sphere of Ai. 33
  • 34. Presenter: Rina You General Partitioning and Clustering Schemes • Lower bound the average probability of accessing a line clusters 1 l avg Pvisit  VM (mbr(Ci ))   VM (line ( A, P( A)))dA l i 1 A – Calculate the average volume of minkowski sums over all possible pairs A and P(A) in the data space 34
  • 35. Presenter: Rina You General Partitioning and Clustering Schemes • Conclusion 1 (Performance) – For any clustering and partitioning method, a simple sequential scan performs better. if the number of dimensions exceeds some d. • Conclusion 2 (Complexity) – The complexity of any clustering and partitioning methods tends towards O(N) as dimensionality increases. 35
  • 36. Presenter: Rina You General Partitioning and Clustering Schemes • Conclusion 3 (Degeneration) – All blocks are accessed if the number of dimensions exceeds some d 36
  • 37. Presenter: Kilho Lee The VA-file • Accelerates that unavoidable scan by using object approximations to compress the vector data. • Reduces the amount of data that must be read during similarity searches. • Compressing vector data • The filtering step • Accessing the data 37
  • 38. Presenter: Kilho Lee The VA-file Compressing vector data 1 d P["in _ cell " ]  Vol (cell )  ( bi )  2b 2 b N 1 N P[ Share]  1  (1  2 )  b 2 • For each dimension i, a small number of bits (bi) is assigned • Let b be the sum of all bi’s, b  i 1 bi d • The data space is divided into 2b 38
  • 39. Presenter: Kilho Lee The VA-file Filtering step • When searching for the nearest neighbor, the entire approximation file is scanned and upper and lower bounds on the distance to the query • Let δ is the smallest upper bound found so far. • if a approx has lower bound exceeds δ, it will be filtered. 39 ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )
  • 40. Presenter: Kilho Lee The VA-file Filtering step • After the filtering step, less than 0.1% of vectors remaining. 40
  • 41. Presenter: Kilho Lee The VA-file Accessing the vector • After the filtering step, a small set of candidates remain. • candidates are sorted by lower bound • If a lower bound is encountered that exceeds the nearest distance seen so far, the VA-file method stops. 41
  • 42. Presenter: Kilho Lee The VA-file Accessing the vector • less than 1% of vector blocks are visited. • In d = 50, bi = 6, N = 500,000 case, only 20 vectors are accessed. 42
  • 43. Presenter: Kilho Lee Performance •Figure depicts the percentage of blocks visited. 43
  • 44. Presenter: Kilho Lee Conclusion • conventional indexing methods are out-performed by a simple sequential scan at moderate dimensionality ( d = 10) • At moderate and high dimensionality ( d ≥ 6 ), the VA-file method can out-perform any other method. 44
  • 45. 45