2. Problem Statement
• Given a query point q,
• Find closest items to the query
point with the probability of 1 − 𝛿
• Iterative methods?
• Large volume of data
• Curse of dimensionality
3. Taxonomy – Near Neighbor Query (NN)
NN
Trees
K-d Tree Range Tree B Tree Cover Tree
Grid
Voronoi
Diagram
Hash
Approximate
LSH
4. Approximate LSH
• Simple Idea
• if two points are close together, then after a “projection” operation these two
points will remain close together
5. LSH Requirement
• For any given points 𝑝, 𝑞 ∈ 𝑅 𝑑
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≥ 𝑃1 𝑓𝑜𝑟 𝑝 − 𝑞 ≤ 𝑑1
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≤ 𝑃2 𝑓𝑜𝑟 𝑝 − 𝑞 ≥ 𝑐𝑑1 = 𝑑2
• Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive, Ideally we need
• (𝑃1−𝑃2) to be large
• (𝑑1−𝑑2) to be small
8. Hash Function(Random)
• Locality-preserving
• Independent
• Deterministic
• Family of Hash Function per various distance measures
• Euclidean
• Jaccard
• Cosine Similarity
• Hamming
9. LSH Family for Euclidean distance (2d)
• When d. cos 𝜃 ≤ 𝑎,
• Chance of colliding
• But not certain
• But can guarantee,
• If 𝑑 ≤ 𝑎/2,
• 90 ≥ 𝜃 ≥ 45 to have d. cos 𝜃 ≤ 𝑎
• ∴ 𝑃1 ≥ 1/2
• If 𝑑 ≥ 2𝑎,
• 90 ≥ 𝜃 ≥ 60 to have d. cos 𝜃 ≤ 𝑎
• ∴ 𝑃2 ≤ 1/3
• As LSH (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive
• (𝑎, 2𝑎,
1
2
,
1
3
)
10. How to define the projection?
• Scalar projection (Dot product)
ℎ
𝑣
=
𝑣
.
𝑥
;
𝑣
= 𝑞𝑢𝑒𝑟𝑦 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑑 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑠𝑝𝑎𝑐𝑒
𝑥
= 𝑣𝑒𝑐𝑡𝑜𝑟 𝑤𝑖𝑡ℎ 𝑟𝑎𝑛𝑑𝑜𝑚 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 𝑓𝑟𝑜𝑚 𝑁(0,1)
ℎ
𝑣
= 𝑣
.
𝑥
+ 𝑏
𝑤
;
𝑤 − 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑏𝑖𝑛
𝑏 − random variable uniformly distributed between 0 and w
11. How to define the projection?
• K-dot product, that
(
𝑃1
𝑃2
) 𝑘> (
𝑃1
𝑃2
)
points at different separations will fall into the same quantization bin
• Perform k independent dot products
• Achieve success,
• if the query and the nearest neighbor are in the same bin in all k dot products
• Success probability = 𝑃1
𝑘
; decreases as we include more dot products
12. Multiple-projections
• L independent projections
• True near neighbor will be unlikely to be unlucky in all the projections
• By increasing L,
• we can find the true nearest neighbor with arbitrarily high probability
13. Accuracy
• Two close points p and q,
• Separated by 𝑢 = 𝑝 − 𝑞
• Probability of collision 𝑃 𝐻 𝑢 ,
𝑃 𝐻 𝑢 = (𝑃 𝐻(𝐻 𝑝 = 𝐻(𝑞))
=
0
𝑤
1
𝑢
. 𝑓𝑠
𝑡
𝑢
. 1 −
𝑡
𝑤
𝑑𝑡
𝑓𝑠- probability density function of H
• As distance u increases, 𝑃 𝐻 𝑢 decreases
14. Time complexity
• For a query point q,
• To Find the near neighbor: (𝑇𝑔+𝑇𝑐)
• Calculate & hash the projections (𝑇𝑔)
• O(DkL); D−dimension, kL projections
• Search the bucket for collisions (𝑇𝑐)
• O(DL𝑁𝑐); D-dimension, L projections, and
• where 𝑁𝑐 = 𝑞′∈𝐷 𝑝 𝑘
. | 𝑞 − 𝑞′
|; 𝑁𝑐 - expected number of collisions for single projection
• Analyze
• 𝑇𝑔 increases as k & L increase
• 𝑇𝑐 decreases as k increases since 𝑝 𝑘 < 𝑝
15. How many projections(L)?
• For query point p & neighbor q,
• For single projection,
• Success probability of collisions: ≥ 𝑃1
𝑘
• For L projections,
• Failure probability of collisions: ≤ (1 − 𝑃1
𝑘
) 𝐿
∴ (1 − 𝑃1
𝑘
) 𝐿= 𝛿
𝐿 =
log 𝛿
log(1 − 𝑃1
𝑘
)
17. REFERENCES
[1] Anand Rajaraman and Jeff Ullman, “Chapter Three of ‘Mining of
Massive Datasets,’” pp. 72–130.
[2] M. Slaney and M. Casey, “Lecture Note: LSH,” 2008.
[3] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk,
S. Madden, and P. Dubey, “Streaming similarity search over one billion
tweets using parallel locality-sensitive hashing,” Proc. VLDB Endow., vol.
6, no. 14, pp. 1930–1941, Sep. 2013.
Notes de l'éditeur
A randomized algorithm does not guarantee an exact answer but instead provides a high proba- bility guarantee that it will return the cor- rect answer or one close to it
O(log N) ; N – number of object; when d is one dimensional this is binary search, but when d becomes high
K-d tree algorithm - The problem with multidimensional algorithms such as k-d trees is that they break down when the dimensionality of the search space is greater than a few dimensions O(N)
Grid: Close points should be in same grid cell. But some can always lay across the boundary (no matter how close). Some may be further than 1 grid cell, but still close. And in high dimensions, the number of neighboring grid cells grows exponentially. One option is to randomly shift (and rotate) and try again
Hash – O(1) search, while O(N) memory
Notice that we say nothing about what happens when the distance between the items is strictly between d1 and d2, but we can make d1 and d2 as close as we wish. The penalty is that typically p1 and p2 are then close as well. As we shall see, it is possible to drive p1 and p2 apart while keeping d1 and d2 fixed - according to a Chernoff-Hoeffding bound
the probability that p and q collide under a random choice of hash function depends only on the distance between p and q
In fact, if the angle θ between the randomly chosen line and the line connecting the points is large, then there is an even greater chance that the two points will fall in the same bucket. For instance, if θ is 90 degrees, then the two points are certain to fall in the same bucket.
However, suppose d is larger than a. In order for there to be any chance of the two points falling in the same bucket, we need d cos θ ≤ a
Finding a good hash implementation, and analyzing the hash performance
Increasing the quantization bucket width w will increase the number of points that fall into each bucket. To obtain our final nearest neighbor result we will have to perform a linear search through all the points that fall into the same bucket as the query, so varying w effects a trade-off between a larger table with a smaller final linear search, or a more compact table with more points to consider in the final search