A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces

A Quantitative analysis and performance study for similarity
search methods in high-dimensional spaces

Group 4
Seokhwan Eom,
Jungyeol Lee,
Rina You,
Kilho Lee,

Presenter: Seokhwan Eom

Contents

• Introduction
• Observations
• Analysis of NN-search
• VA-file
• Conclusion

2


The Similarity Search Paradigm

3 ( Reference : What’s wrong with high-dimensional similarity search?, S. Blott, VLDB 2008 )


The Similarity Search Paradigm

Locate closest point to query object, i.e. its nearest neighbor(NN)



The conventional approach

• Space-partitioning methods
- Gridfile [Nievergelt:1984]
- K-D-B tree [Robinson:1981]
- Quad tree [Finkel:1974]

• Data-partitioning index trees
-R-tree [Guttman:1984] -R+-tree [Sellis:1987]
-R*-tree [Beckmann:1990] -X-tree [Berchtold:1996]
-SR-tree [Katayama:1997] -M-tree [Ciaccia:1996]
-TV-tree [Lin:1994] -hB-tree [Lomet:1990]
Unfortunately,
As the number of dimensions increases, their performance degrades.
- The dimensional curse

5


Contribution

• Assumptions : initially uniformly-distributed data within unit
hypercube with independent dimensions

1. Establish lower bounds on the average performance of NN-
search for space- and data-partitioning, and clustering
structures.

2. Show formally that any partitioning scheme and clustering
technique must degenerate to a sequential scan through all
their blocks if the number of dimension is sufficiently large.

3. Present performance results which support their analysis, and
demonstrate that the performance of VA-file offers the best
performance in practice whenever the number of dimensions is
larger than 6.

6


The Difficulties of High Dimensionality
• Observation 1 (Number of partitions)
A simple partitioning scheme :
split the data space in each dimension into two halves.

This seems reasonable with low dimensions.
But with d = 100 there are 2100 ≒ 1030 partitions;
even with 106 points, almost all of the partitions(1024) are empty.

7


• Observation 2 (Data space is sparsely populated)
Consider a hyper-cube range query with size s=0.95
Data space Ω=[0,1]d

Target region

s

s

At d=100,
P d [ s]  s d  0.95100  0.0059

8


• Observation 3 (Spherical range queries)
The probability that an arbitrary point R lies within the largest
spherical query.

Figure: Largest range query Table: Probability that a point lies within the largest
entirely within the data space. range query inside Ω, and the expected database size

9


• Observation 4 (Exponentially growing DB size)
The size which a data set would have to have such that, on average,
d
at least one point falls into the sphere sp (Q,0.5) (for even d):

Table: Probability that a point lies within the largest
range query inside Ω, and the expected database size

10


• Observation 5 (Expected NN-distance)
The probability that the NN-distance is at most r(i.e. the probability that NN to query
point Q is contained in spd (Q,r)):

The expected NN-distance for a query point Q :

The expected NN-distance E[nndist] for any query point in the data space :

11


• Observation 5 (Expected NN-distance)

1. The NN-distance grows steadily with d
2. Beyond trivially-small data sets D, NN-distances decrease only
marginally as the size of D increases.

12

Presenter: Jungyeol Lee

Analysis of NN-Search

• The complexity of any partitioning and clustering
scheme converges to O( N ) with increasing
dimensionality

• General Cost Model
• Space-Partitioning Methods
• Data-Partitioning Methods
• General Partitioning and Clustering Schemes

13


General Cost Model

• ‘Cost’ of a query:
– the number of blocks which must be accessed
• Optimal NN search algorithm:
– Blocks visited during the search
= blocks whose MBR1) intersect the NN-sphere

1) MBR: Minimum Bounding Regions
14


General Cost Model

• Let M visit be the number of blocks visited.
• M visit = The number of blocks
which intersect the sp d (Q, E[nndist ])
• Transform the spherical query into a point query
• Minkowski sum, MSum(mbri , E[nndist ])
E[nn dist ]

mbri

MSum(mbri , E[nndist ])

15


General Cost Model

• Transform the spherical query into a point
query

• Probability that the i -th block must be visit
Pvisit [i]  Vol (MSum(mbri , E[nndist ])  )
N m
• M visit 
N avg
Pvisit , Pvisit 
avg m
P visit [i ]
m N i 0
16


Space-Partitioning Methods

• Dividing  regardless of clusters
• If each dimension is split once,
the total # of partitions: 2 , the space overhead: O(2 )
d d

• To reduce the space overhead, only d '  d dimensions
are split such that, on average, m points are assigned
to a partition
N   N
2   ,
d'
d '  log 2 
m  m

17



• Let lmax denote the maximum distance from mbri to
any point in the data space
 N
d '  log 2 
 m

1 1  N
lmax  d'  log 2 
2 2 
 m

• lmax  E[nndist ], at some dimensionality
• From that dimensionality, Minkowski sum covers the
entire data space
• Pvisit converges into 1 same as sequential scan
18



• Pvisit [i]  Vol (MSum(mbri , E[nndist ])  )  1
• Fig. 7 Comparison of lmax with E[nndist ]

19

Presenter: Rina You

Data-Partitioning Methods

• Data-partitioning methods partition the data
space hierarchically
– In order to reduce the search cost from N  to log N 

• Impracticability of existing methods for NN-
search in HDVSs.
– A sequential scan out-performed these more sophisticated
hierarchical methods.

20

Presenter: Rina You

Rectangular MBRs

• Index methods use hyper-cubes to bound the
region of a block.

• Splitting a node results in two new, equally-full
partitions of the data space.
• d’ dimensions are split at high dimensionality

 N
d  log 2 
'

 m

21

Presenter: Rina You

Rectangular MBRs

• rectangular MBR
– d’ sides with a length of 1/2
– d - d’ sides with a length of 1.

• the probability of visiting a block during
NN-search
: the volume of that part of the extended box in the data
space

22

Presenter: Rina You

Rectangular MBRs

• the probability of accessing a block during a
NN-search
– different database sizes and different values of d’

23

Presenter: Rina You

Spherical MBRs

• Another group of index structures
– MBRs in the form of hyper-spheres.

• Each block of optimal structure consists of
– the center point C
– m - 1 nearest neighbors

• MBR can be described by nn sp, m 1
C 

24

Presenter: Rina You

Spherical MBRs

• The probability of accessing a block during
the search.

• MBRs in the form of hyper-spheres : nn sp, m 1
C 
• use a Minkowski sum
d
sp C, nn dist, m1
c  Enn dist

• The probability that block i must be visited
during a NN-search
P sp
visit i  Vol sp C, nn
d dist,m1
c  Enn  
dist

25

Presenter: Rina You

Spherical MBRs

• another lower bound for this probability
– replace nn dist,m1 by nn dist,1  Enn dist 

P sp
visit i  Vol sp C,2  Enn  
d dist

• If i increases, nn dist,i does not decrease.
–
j  i : nn dist, j
 nn dist,i

26

Presenter: Rina You

Spherical MBRs

• The probability of accessing a block
during the search
– average the above probability over all center
points C   :

P sp, avg
visit  Vol spc,2  Enn  dC
C

27

Presenter: Rina You

Spherical MBRs

• percentage of blocks visited increases rapidly
with the dimensionality

• sequential scan will perform better in practice
28

Presenter: Rina You

General Partitioning and Clustering
Schemes

• No partitioning or clustering scheme
can offer efficient NN-search
– if the number of dimensions becomes large.

• The complexity of methods : ON 
• A large portion (up to 100%) of data
blocks must be read
– In order to determine the nearest neighbor.

29

Presenter: Rina You

Schemes

• Basic assumptions:
1. A cluster is a geometrical form (MBR) that
covers all cluster points
2. Each cluster contains at least two points
3. The MBR of a cluster is convex.

30

Presenter: Rina You

Schemes

• Average probability of accessing a cluster
during an NN-search
1 l
p avg
visit  VM mbrCi 
l i 1

 
VM x   Vol MSum x, E[nn dist
] 

31

Presenter: Rina You

Schemes
• Lower bound the average probability
of accessing a line cluster.
• Pick two arbitrary data points
– each cluster contains at least two points
• line  Ai, Bi  is contained in mbr Ci 
– mbr Ci  is convex.
• Lower bound the volume of the
extended mbr Ci 
: VM mbrCi   VM line  Ai, Bi 
32

Presenter: Rina You

Schemes

• Lower bound the distance between Ai
and Bi : VM line ( Ai, Bi )   VM line ( Ai, Pi ) 
 min VM (line ( Ai, Qi ))
Qsurf ( nn ( Ai ))
sp

With Pi  surf (nn sp ( Ai ))
– Points in surface of nn-sphere of Ai have
minimal minkowski sum for line(Ai, Bi)
– Line(Ai, Pi) is the optimal line cluster for
point A
• If Pi is point in surface of nn-sphere of Ai.
33

Presenter: Rina You

Schemes

• Lower bound the average probability
of accessing a line clusters
1 l
avg
Pvisit  VM (mbr(Ci ))   VM (line ( A, P( A)))dA
l i 1 A

– Calculate the average volume of minkowski
sums over all possible pairs A and P(A) in
the data space

34

Presenter: Rina You

Schemes

• Conclusion 1 (Performance)
– For any clustering and partitioning method,
a simple sequential scan performs better.
if the number of dimensions exceeds some d.

• Conclusion 2 (Complexity)
– The complexity of any clustering and
partitioning methods tends towards O(N)
as dimensionality increases.
35

Presenter: Rina You

Schemes

• Conclusion 3 (Degeneration)
– All blocks are accessed
if the number of dimensions exceeds some d

36

Presenter: Kilho Lee

The VA-file

• Accelerates that unavoidable scan by using object
approximations to compress the vector data.
• Reduces the amount of data that must be read during
similarity searches.

• Compressing vector data
• The filtering step
• Accessing the data

37


The VA-file
Compressing vector data

1 d
P["in _ cell " ]  Vol (cell )  ( bi )  2b
2
b N 1 N
P[ Share]  1  (1  2 )  b
2

• For each dimension i, a small number of bits (bi) is assigned
• Let b be the sum of all bi’s, b  i 1 bi
d

• The data space is divided into 2b

38


The VA-file
Filtering step

• When searching for the nearest neighbor, the entire approximation file
is scanned and upper and lower bounds on the distance to the query
• Let δ is the smallest upper bound found so far.
• if a approx has lower bound exceeds δ, it will be filtered.



The VA-file
Filtering step

• After the filtering step, less than 0.1% of vectors remaining.

40


The VA-file
Accessing the vector

• After the filtering step, a small set of candidates remain.
• candidates are sorted by lower bound
• If a lower bound is encountered that exceeds the nearest distance seen
so far, the VA-file method stops.

41


The VA-file
Accessing the vector

• less than 1% of vector blocks are visited.
• In d = 50, bi = 6, N = 500,000 case, only 20 vectors are accessed.

42


Performance

•Figure depicts the percentage of blocks visited.

43


Conclusion

• conventional indexing methods are out-performed by a
simple sequential scan at moderate dimensionality ( d = 10)
• At moderate and high dimensionality ( d ≥ 6 ), the VA-file method
can out-perform any other method.

44

A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces

Similaire à A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces (20)

Dernier

Dernier (20)

A Quantitative analysis and performance study for similarity search methods in high-dimensional spaces