SlideShare une entreprise Scribd logo
1  sur  17
Locality Sensitive Hashing
Randomized Algorithm
Problem Statement
• Given a query point q,
• Find closest items to the query
point with the probability of 1 − 𝛿
• Iterative methods?
• Large volume of data
• Curse of dimensionality
Taxonomy – Near Neighbor Query (NN)
NN
Trees
K-d Tree Range Tree B Tree Cover Tree
Grid
Voronoi
Diagram
Hash
Approximate
LSH
Approximate LSH
• Simple Idea
• if two points are close together, then after a “projection” operation these two
points will remain close together
LSH Requirement
• For any given points 𝑝, 𝑞 ∈ 𝑅 𝑑
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≥ 𝑃1 𝑓𝑜𝑟 𝑝 − 𝑞 ≤ 𝑑1
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≤ 𝑃2 𝑓𝑜𝑟 𝑝 − 𝑞 ≥ 𝑐𝑑1 = 𝑑2
• Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive, Ideally we need
• (𝑃1−𝑃2) to be large
• (𝑑1−𝑑2) to be small
P
d
2d
c.d
q
q
≥ P(1)
≥ P(2)
≥ P(c) P(1) ≥P(2) ≥P(3)
q
Probability vs. Distance on candidate pairs
Hash Function(Random)
• Locality-preserving
• Independent
• Deterministic
• Family of Hash Function per various distance measures
• Euclidean
• Jaccard
• Cosine Similarity
• Hamming
LSH Family for Euclidean distance (2d)
• When d. cos 𝜃 ≤ 𝑎,
• Chance of colliding
• But not certain
• But can guarantee,
• If 𝑑 ≤ 𝑎/2,
• 90 ≥ 𝜃 ≥ 45 to have d. cos 𝜃 ≤ 𝑎
• ∴ 𝑃1 ≥ 1/2
• If 𝑑 ≥ 2𝑎,
• 90 ≥ 𝜃 ≥ 60 to have d. cos 𝜃 ≤ 𝑎
• ∴ 𝑃2 ≤ 1/3
• As LSH (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive
• (𝑎, 2𝑎,
1
2
,
1
3
)
How to define the projection?
• Scalar projection (Dot product)
ℎ
𝑣
=
𝑣
.
𝑥
;
𝑣
= 𝑞𝑢𝑒𝑟𝑦 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑑 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑠𝑝𝑎𝑐𝑒
𝑥
= 𝑣𝑒𝑐𝑡𝑜𝑟 𝑤𝑖𝑡ℎ 𝑟𝑎𝑛𝑑𝑜𝑚 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 𝑓𝑟𝑜𝑚 𝑁(0,1)
ℎ
𝑣
= 𝑣
.
𝑥
+ 𝑏
𝑤
;
𝑤 − 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑏𝑖𝑛
𝑏 − random variable uniformly distributed between 0 and w
How to define the projection?
• K-dot product, that
(
𝑃1
𝑃2
) 𝑘> (
𝑃1
𝑃2
)
points at different separations will fall into the same quantization bin
• Perform k independent dot products
• Achieve success,
• if the query and the nearest neighbor are in the same bin in all k dot products
• Success probability = 𝑃1
𝑘
; decreases as we include more dot products
Multiple-projections
• L independent projections
• True near neighbor will be unlikely to be unlucky in all the projections
• By increasing L,
• we can find the true nearest neighbor with arbitrarily high probability
Accuracy
• Two close points p and q,
• Separated by 𝑢 = 𝑝 − 𝑞
• Probability of collision 𝑃 𝐻 𝑢 ,
𝑃 𝐻 𝑢 = (𝑃 𝐻(𝐻 𝑝 = 𝐻(𝑞))
=
0
𝑤
1
𝑢
. 𝑓𝑠
𝑡
𝑢
. 1 −
𝑡
𝑤
𝑑𝑡
𝑓𝑠- probability density function of H
• As distance u increases, 𝑃 𝐻 𝑢 decreases
Time complexity
• For a query point q,
• To Find the near neighbor: (𝑇𝑔+𝑇𝑐)
• Calculate & hash the projections (𝑇𝑔)
• O(DkL); D−dimension, kL projections
• Search the bucket for collisions (𝑇𝑐)
• O(DL𝑁𝑐); D-dimension, L projections, and
• where 𝑁𝑐 = 𝑞′∈𝐷 𝑝 𝑘
. | 𝑞 − 𝑞′
|; 𝑁𝑐 - expected number of collisions for single projection
• Analyze
• 𝑇𝑔 increases as k & L increase
• 𝑇𝑐 decreases as k increases since 𝑝 𝑘 < 𝑝
How many projections(L)?
• For query point p & neighbor q,
• For single projection,
• Success probability of collisions: ≥ 𝑃1
𝑘
• For L projections,
• Failure probability of collisions: ≤ (1 − 𝑃1
𝑘
) 𝐿
∴ (1 − 𝑃1
𝑘
) 𝐿= 𝛿
𝐿 =
log 𝛿
log(1 − 𝑃1
𝑘
)
LSH in MAXDIVREL Diversity
#1 #2 #3 … #k dot
product
1 1 0 0 .. 1
2 0 1 1 … 1
w 0 0 1 … 0
#1 #2 #3 … #k dot
product
1 1 1 0 .. 1
2 1 0 1 … 1
w 0 1 1 … 0
#1 #2 #3 … #k dot
product
1 1 0 1 .. 0
2 0 0 1 … 0
w 0 1 0 … 0
#1 #2 #3 … #k dot
product
1 1 0 0 .. 1
2 0 1 1 … 1
w 0 0 1 … 0
REFERENCES
[1] Anand Rajaraman and Jeff Ullman, “Chapter Three of ‘Mining of
Massive Datasets,’” pp. 72–130.
[2] M. Slaney and M. Casey, “Lecture Note: LSH,” 2008.
[3] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk,
S. Madden, and P. Dubey, “Streaming similarity search over one billion
tweets using parallel locality-sensitive hashing,” Proc. VLDB Endow., vol.
6, no. 14, pp. 1930–1941, Sep. 2013.

Contenu connexe

Tendances

Adversarial search
Adversarial searchAdversarial search
Adversarial searchDheerendra k
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)Cory Cook
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataYao-Chieh Hu
 
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VECUnit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VECsundarKanagaraj1
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscanYan Xu
 
Rabin Carp String Matching algorithm
Rabin Carp String Matching  algorithmRabin Carp String Matching  algorithm
Rabin Carp String Matching algorithmsabiya sabiya
 
Simulated Annealing
Simulated AnnealingSimulated Annealing
Simulated Annealingkellison00
 
Neural Networks: Support Vector machines
Neural Networks: Support Vector machinesNeural Networks: Support Vector machines
Neural Networks: Support Vector machinesMostafa G. M. Mostafa
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Greedy algorithms
Greedy algorithmsGreedy algorithms
Greedy algorithmsRajendran
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)Pravinkumar Landge
 
Adversarial search
Adversarial searchAdversarial search
Adversarial searchNilu Desai
 

Tendances (20)

Adversarial search
Adversarial searchAdversarial search
Adversarial search
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
 
Greedy algorithm
Greedy algorithmGreedy algorithm
Greedy algorithm
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
 
Greedy algorithms
Greedy algorithmsGreedy algorithms
Greedy algorithms
 
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VECUnit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
Kmp
KmpKmp
Kmp
 
Rabin Carp String Matching algorithm
Rabin Carp String Matching  algorithmRabin Carp String Matching  algorithm
Rabin Carp String Matching algorithm
 
Simulated Annealing
Simulated AnnealingSimulated Annealing
Simulated Annealing
 
Neural Networks: Support Vector machines
Neural Networks: Support Vector machinesNeural Networks: Support Vector machines
Neural Networks: Support Vector machines
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Greedy algorithms
Greedy algorithmsGreedy algorithms
Greedy algorithms
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
 
Uncertainty in AI
Uncertainty in AIUncertainty in AI
Uncertainty in AI
 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 

Similaire à Locality sensitive hashing

Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentssuser2be88c
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Hwa Pyung Kim
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxSubrata Kumer Paul
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Ted Dunning
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSLiemNguyenDuy
 
cnn.pptx
cnn.pptxcnn.pptx
cnn.pptxsghorai
 
A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...MITSUNARI Shigeo
 
A short introduction to Quantum Computing and Quantum Cryptography
A short introduction to Quantum Computing and Quantum CryptographyA short introduction to Quantum Computing and Quantum Cryptography
A short introduction to Quantum Computing and Quantum CryptographyFacultad de Informática UCM
 
Bounded arithmetic in free logic
Bounded arithmetic in free logicBounded arithmetic in free logic
Bounded arithmetic in free logicYamagata Yoriyuki
 
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23Aritra Sarkar
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2San Kim
 

Similaire à Locality sensitive hashing (20)

Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptx
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
SVD.ppt
SVD.pptSVD.ppt
SVD.ppt
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
 
cnn.pptx
cnn.pptxcnn.pptx
cnn.pptx
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...
 
A short introduction to Quantum Computing and Quantum Cryptography
A short introduction to Quantum Computing and Quantum CryptographyA short introduction to Quantum Computing and Quantum Cryptography
A short introduction to Quantum Computing and Quantum Cryptography
 
Bounded arithmetic in free logic
Bounded arithmetic in free logicBounded arithmetic in free logic
Bounded arithmetic in free logic
 
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
 
Data Mining Lecture_9.pptx
Data Mining Lecture_9.pptxData Mining Lecture_9.pptx
Data Mining Lecture_9.pptx
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
 

Plus de Sameera Horawalavithana

Data-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationData-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationSameera Horawalavithana
 
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
 Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
Drivers of Polarized Discussions on Twitter during Venezuela Political CrisisSameera Horawalavithana
 
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
 Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Twitter Is the Megaphone of Cross-platform Messaging on the White HelmetsSameera Horawalavithana
 
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Sameera Horawalavithana
 
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHubMentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHubSameera Horawalavithana
 
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...Sameera Horawalavithana
 
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...Sameera Horawalavithana
 
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation [ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation Sameera Horawalavithana
 
Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015Sameera Horawalavithana
 
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...Sameera Horawalavithana
 
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...Sameera Horawalavithana
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingSameera Horawalavithana
 

Plus de Sameera Horawalavithana (17)

Data-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationData-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and Simulation
 
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
 Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
 
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
 Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
 
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
 
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHubMentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
 
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
 
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Dancing with Stream Processing
Dancing with Stream ProcessingDancing with Stream Processing
Dancing with Stream Processing
 
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation [ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
 
Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015
 
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
 
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
 
Zipf distribution
Zipf distributionZipf distribution
Zipf distribution
 
Query personalization
Query personalizationQuery personalization
Query personalization
 
Dancing with publish/subscribe
Dancing with publish/subscribeDancing with publish/subscribe
Dancing with publish/subscribe
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
 

Dernier

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 

Dernier (20)

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 

Locality sensitive hashing

  • 2. Problem Statement • Given a query point q, • Find closest items to the query point with the probability of 1 − 𝛿 • Iterative methods? • Large volume of data • Curse of dimensionality
  • 3. Taxonomy – Near Neighbor Query (NN) NN Trees K-d Tree Range Tree B Tree Cover Tree Grid Voronoi Diagram Hash Approximate LSH
  • 4. Approximate LSH • Simple Idea • if two points are close together, then after a “projection” operation these two points will remain close together
  • 5. LSH Requirement • For any given points 𝑝, 𝑞 ∈ 𝑅 𝑑 𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≥ 𝑃1 𝑓𝑜𝑟 𝑝 − 𝑞 ≤ 𝑑1 𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≤ 𝑃2 𝑓𝑜𝑟 𝑝 − 𝑞 ≥ 𝑐𝑑1 = 𝑑2 • Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive, Ideally we need • (𝑃1−𝑃2) to be large • (𝑑1−𝑑2) to be small
  • 6. P d 2d c.d q q ≥ P(1) ≥ P(2) ≥ P(c) P(1) ≥P(2) ≥P(3) q
  • 7. Probability vs. Distance on candidate pairs
  • 8. Hash Function(Random) • Locality-preserving • Independent • Deterministic • Family of Hash Function per various distance measures • Euclidean • Jaccard • Cosine Similarity • Hamming
  • 9. LSH Family for Euclidean distance (2d) • When d. cos 𝜃 ≤ 𝑎, • Chance of colliding • But not certain • But can guarantee, • If 𝑑 ≤ 𝑎/2, • 90 ≥ 𝜃 ≥ 45 to have d. cos 𝜃 ≤ 𝑎 • ∴ 𝑃1 ≥ 1/2 • If 𝑑 ≥ 2𝑎, • 90 ≥ 𝜃 ≥ 60 to have d. cos 𝜃 ≤ 𝑎 • ∴ 𝑃2 ≤ 1/3 • As LSH (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive • (𝑎, 2𝑎, 1 2 , 1 3 )
  • 10. How to define the projection? • Scalar projection (Dot product) ℎ 𝑣 = 𝑣 . 𝑥 ; 𝑣 = 𝑞𝑢𝑒𝑟𝑦 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑑 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑠𝑝𝑎𝑐𝑒 𝑥 = 𝑣𝑒𝑐𝑡𝑜𝑟 𝑤𝑖𝑡ℎ 𝑟𝑎𝑛𝑑𝑜𝑚 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 𝑓𝑟𝑜𝑚 𝑁(0,1) ℎ 𝑣 = 𝑣 . 𝑥 + 𝑏 𝑤 ; 𝑤 − 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑏𝑖𝑛 𝑏 − random variable uniformly distributed between 0 and w
  • 11. How to define the projection? • K-dot product, that ( 𝑃1 𝑃2 ) 𝑘> ( 𝑃1 𝑃2 ) points at different separations will fall into the same quantization bin • Perform k independent dot products • Achieve success, • if the query and the nearest neighbor are in the same bin in all k dot products • Success probability = 𝑃1 𝑘 ; decreases as we include more dot products
  • 12. Multiple-projections • L independent projections • True near neighbor will be unlikely to be unlucky in all the projections • By increasing L, • we can find the true nearest neighbor with arbitrarily high probability
  • 13. Accuracy • Two close points p and q, • Separated by 𝑢 = 𝑝 − 𝑞 • Probability of collision 𝑃 𝐻 𝑢 , 𝑃 𝐻 𝑢 = (𝑃 𝐻(𝐻 𝑝 = 𝐻(𝑞)) = 0 𝑤 1 𝑢 . 𝑓𝑠 𝑡 𝑢 . 1 − 𝑡 𝑤 𝑑𝑡 𝑓𝑠- probability density function of H • As distance u increases, 𝑃 𝐻 𝑢 decreases
  • 14. Time complexity • For a query point q, • To Find the near neighbor: (𝑇𝑔+𝑇𝑐) • Calculate & hash the projections (𝑇𝑔) • O(DkL); D−dimension, kL projections • Search the bucket for collisions (𝑇𝑐) • O(DL𝑁𝑐); D-dimension, L projections, and • where 𝑁𝑐 = 𝑞′∈𝐷 𝑝 𝑘 . | 𝑞 − 𝑞′ |; 𝑁𝑐 - expected number of collisions for single projection • Analyze • 𝑇𝑔 increases as k & L increase • 𝑇𝑐 decreases as k increases since 𝑝 𝑘 < 𝑝
  • 15. How many projections(L)? • For query point p & neighbor q, • For single projection, • Success probability of collisions: ≥ 𝑃1 𝑘 • For L projections, • Failure probability of collisions: ≤ (1 − 𝑃1 𝑘 ) 𝐿 ∴ (1 − 𝑃1 𝑘 ) 𝐿= 𝛿 𝐿 = log 𝛿 log(1 − 𝑃1 𝑘 )
  • 16. LSH in MAXDIVREL Diversity #1 #2 #3 … #k dot product 1 1 0 0 .. 1 2 0 1 1 … 1 w 0 0 1 … 0 #1 #2 #3 … #k dot product 1 1 1 0 .. 1 2 1 0 1 … 1 w 0 1 1 … 0 #1 #2 #3 … #k dot product 1 1 0 1 .. 0 2 0 0 1 … 0 w 0 1 0 … 0 #1 #2 #3 … #k dot product 1 1 0 0 .. 1 2 0 1 1 … 1 w 0 0 1 … 0
  • 17. REFERENCES [1] Anand Rajaraman and Jeff Ullman, “Chapter Three of ‘Mining of Massive Datasets,’” pp. 72–130. [2] M. Slaney and M. Casey, “Lecture Note: LSH,” 2008. [3] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey, “Streaming similarity search over one billion tweets using parallel locality-sensitive hashing,” Proc. VLDB Endow., vol. 6, no. 14, pp. 1930–1941, Sep. 2013.

Notes de l'éditeur

  1. A randomized algorithm does not guarantee an exact answer but instead provides a high proba- bility guarantee that it will return the cor- rect answer or one close to it
  2. O(log N) ; N – number of object; when d is one dimensional this is binary search, but when d becomes high K-d tree algorithm - The problem with multidimensional algorithms such as k-d trees is that they break down when the dimensionality of the search space is greater than a few dimensions O(N) Grid: Close points should be in same grid cell. But some can always lay across the boundary (no matter how close). Some may be further than 1 grid cell, but still close. And in high dimensions, the number of neighboring grid cells grows exponentially. One option is to randomly shift (and rotate) and try again Hash – O(1) search, while O(N) memory
  3. Notice that we say nothing about what happens when the distance between the items is strictly between d1 and d2, but we can make d1 and d2 as close as we wish. The penalty is that typically p1 and p2 are then close as well. As we shall see, it is possible to drive p1 and p2 apart while keeping d1 and d2 fixed - according to a Chernoff-Hoeffding bound
  4. the probability that p and q collide under a random choice of hash function depends only on the distance between p and q
  5. In fact, if the angle θ between the randomly chosen line and the line connecting the points is large, then there is an even greater chance that the two points will fall in the same bucket. For instance, if θ is 90 degrees, then the two points are certain to fall in the same bucket. However, suppose d is larger than a. In order for there to be any chance of the two points falling in the same bucket, we need d cos θ ≤ a
  6. Finding a good hash implementation, and analyzing the hash performance
  7. Increasing the quantization bucket width w will increase the number of points that fall into each bucket. To obtain our final nearest neighbor result we will have to perform a linear search through all the points that fall into the same bucket as the query, so varying w effects a trade-off between a larger table with a smaller final linear search, or a more compact table with more points to consider in the final search