Locality sensitive hashing

Locality Sensitive Hashing
Randomized Algorithm

Problem Statement
• Given a query point q,
• Find closest items to the query
point with the probability of 1 − 𝛿
• Iterative methods?
• Large volume of data
• Curse of dimensionality

Taxonomy – Near Neighbor Query (NN)
NN
Trees
K-d Tree Range Tree B Tree Cover Tree
Grid
Voronoi
Diagram
Hash
Approximate
LSH

Approximate LSH
• Simple Idea
• if two points are close together, then after a “projection” operation these two
points will remain close together

LSH Requirement
• For any given points 𝑝, 𝑞 ∈ 𝑅 𝑑
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≥ 𝑃1 𝑓𝑜𝑟 𝑝 − 𝑞 ≤ 𝑑1
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≤ 𝑃2 𝑓𝑜𝑟 𝑝 − 𝑞 ≥ 𝑐𝑑1 = 𝑑2
• Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive, Ideally we need
• (𝑃1−𝑃2) to be large
• (𝑑1−𝑑2) to be small

P
d
2d
c.d
q
q
≥ P(1)
≥ P(2)
≥ P(c) P(1) ≥P(2) ≥P(3)
q

Probability vs. Distance on candidate pairs

Hash Function(Random)
• Locality-preserving
• Independent
• Deterministic
• Family of Hash Function per various distance measures
• Euclidean
• Jaccard
• Cosine Similarity
• Hamming

LSH Family for Euclidean distance (2d)
• When d. cos 𝜃 ≤ 𝑎,
• Chance of colliding
• But not certain
• But can guarantee,
• If 𝑑 ≤ 𝑎/2,
• 90 ≥ 𝜃 ≥ 45 to have d. cos 𝜃 ≤ 𝑎
• ∴ 𝑃1 ≥ 1/2
• If 𝑑 ≥ 2𝑎,
• 90 ≥ 𝜃 ≥ 60 to have d. cos 𝜃 ≤ 𝑎
• ∴ 𝑃2 ≤ 1/3
• As LSH (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive
• (𝑎, 2𝑎,
1
2
,
1
3
)

How to define the projection?
• Scalar projection (Dot product)
ℎ
𝑣
=
𝑣
.
𝑥
;
𝑣
= 𝑞𝑢𝑒𝑟𝑦 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑑 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑠𝑝𝑎𝑐𝑒
𝑥
= 𝑣𝑒𝑐𝑡𝑜𝑟 𝑤𝑖𝑡ℎ 𝑟𝑎𝑛𝑑𝑜𝑚 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 𝑓𝑟𝑜𝑚 𝑁(0,1)
ℎ
𝑣
= 𝑣
.
𝑥
+ 𝑏
𝑤
;
𝑤 − 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑏𝑖𝑛
𝑏 − random variable uniformly distributed between 0 and w

How to define the projection?
• K-dot product, that
(
𝑃1
𝑃2
) 𝑘> (
𝑃1
𝑃2
)
points at different separations will fall into the same quantization bin
• Perform k independent dot products
• Achieve success,
• if the query and the nearest neighbor are in the same bin in all k dot products
• Success probability = 𝑃1
𝑘
; decreases as we include more dot products

Multiple-projections
• L independent projections
• True near neighbor will be unlikely to be unlucky in all the projections
• By increasing L,
• we can find the true nearest neighbor with arbitrarily high probability

Accuracy
• Two close points p and q,
• Separated by 𝑢 = 𝑝 − 𝑞
• Probability of collision 𝑃 𝐻 𝑢 ,
𝑃 𝐻 𝑢 = (𝑃 𝐻(𝐻 𝑝 = 𝐻(𝑞))
=
0
𝑤
1
𝑢
. 𝑓𝑠
𝑡
𝑢
. 1 −
𝑡
𝑤
𝑑𝑡
𝑓𝑠- probability density function of H
• As distance u increases, 𝑃 𝐻 𝑢 decreases

Time complexity
• For a query point q,
• To Find the near neighbor: (𝑇𝑔+𝑇𝑐)
• Calculate & hash the projections (𝑇𝑔)
• O(DkL); D−dimension, kL projections
• Search the bucket for collisions (𝑇𝑐)
• O(DL𝑁𝑐); D-dimension, L projections, and
• where 𝑁𝑐 = 𝑞′∈𝐷 𝑝 𝑘
. | 𝑞 − 𝑞′
|; 𝑁𝑐 - expected number of collisions for single projection
• Analyze
• 𝑇𝑔 increases as k & L increase
• 𝑇𝑐 decreases as k increases since 𝑝 𝑘 < 𝑝

How many projections(L)?
• For query point p & neighbor q,
• For single projection,
• Success probability of collisions: ≥ 𝑃1
𝑘
• For L projections,
• Failure probability of collisions: ≤ (1 − 𝑃1
𝑘
) 𝐿
∴ (1 − 𝑃1
𝑘
) 𝐿= 𝛿
𝐿 =
log 𝛿
log(1 − 𝑃1
𝑘
)

LSH in MAXDIVREL Diversity
#1 #2 #3 … #k dot
product
1 1 0 0 .. 1
2 0 1 1 … 1
w 0 0 1 … 0
#1 #2 #3 … #k dot
product
1 1 1 0 .. 1
2 1 0 1 … 1
w 0 1 1 … 0
#1 #2 #3 … #k dot
product
1 1 0 1 .. 0
2 0 0 1 … 0
w 0 1 0 … 0
#1 #2 #3 … #k dot
product
1 1 0 0 .. 1
2 0 1 1 … 1
w 0 0 1 … 0

REFERENCES
[1] Anand Rajaraman and Jeff Ullman, “Chapter Three of ‘Mining of
Massive Datasets,’” pp. 72–130.
[2] M. Slaney and M. Casey, “Lecture Note: LSH,” 2008.
[3] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk,
S. Madden, and P. Dubey, “Streaming similarity search over one billion
tweets using parallel locality-sensitive hashing,” Proc. VLDB Endow., vol.
6, no. 14, pp. 1930–1941, Sep. 2013.

Locality sensitive hashing

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Locality sensitive hashing

Similaire à Locality sensitive hashing (20)

Plus de Sameera Horawalavithana

Plus de Sameera Horawalavithana (17)

Dernier

Dernier (20)

Locality sensitive hashing

Notes de l'éditeur