SlideShare a Scribd company logo
1 of 2
Download to read offline
3.1.2 MinHash for similarities
The adjacent matrix is helpful, but there is difficulty with applications when the data is large. They are
primarily large and increase complexity. We are estimating the similarity of all pairs in <<Q(n2
). This is
problematic if we want to use, let us take an example, a commerce site with 12 million products. We
want to identify and provide a ranking similarity score. The total pairs will be 12 million elements, 144
x 1012
pairs. Each pair has a 64-bit float, and we need 1.152 x 1015
bytes to store the adjacency matrix
at the memory. In such large measurements make it difficult to use these data. Things get rough when
we have to go to a more extensive dataset like a social network dataset or web data set. Besides, the
data is highly likely to have many features - columns. So, it is exceedingly challenging to store this data
at the memory and perform similarity checks; we have to find an alternative technique to locate groups
of high similarity pairs. We cannot check all the pairs. The MinHash allows us to compress all these
features to a smaller dimensional space that works well and maintains high dimensionality [20,9].
The basic idea is that the compressed feature spaces maintain similarities among the two objects. Small
signatures will be smaller than the full feature vector. The similarity between these signatures is
equivalent or very similar to the full feature space. Then with Jaccard similarity, because we have a set,
we can find a similar set. The MinHash lets us evaluate similarity in low dimensional space. The locality
sensitive hashing allows (LSH) us to deal with the pair problem. We only evaluate similarity for some
candidates set. Some pairs only matter if they exceed a threshold, which lets us skip a lot of pair
checking. While computing the small signature, we do not have to store the full feature vector.
Similarities of two pairs are equal with similarities to their signature. Moreover, the final step is to check
the pairs with similar signature to measure the similarity with the feature vector. The key idea is to hash
each element with a hash function.
Hashing is converting input of any length into a fixed-size string of text using a mathematical function.
Any text can be converting into an array of numbers and letters through the algorithm. The messages
will be hashed the input. The algorithm is called hashed function, and the output is called hashed values.
The hashed values must be unique; it should be impossible to produce the same hashed values to any
different input. The same message should always produce the same hashed values. The hash speed is an
essential factor. The hash function should always produce quick hash values.
The hash value has to be small enough that the signature fits in memory, and Sim(C1,C2) are the same
with h(C1) and h(C2), also; if Sim(C1) and Sim(C2) are high, then the probability to h(C1) and h(C2) is
high. We have to know that not all similarity hash a suitable function. For example, Jaccard similarity is
suitable for MinHash. The similarities of the two signatures are the fraction of the hash function in which
they agree. Finally, with MinHash, we compressed long vectors into a short signature[20,9,21].

More Related Content

Similar to MinHash_similarities.pdf

Real timefrauddetectiononbigdata
Real timefrauddetectiononbigdataReal timefrauddetectiononbigdata
Real timefrauddetectiononbigdataPranab Ghosh
 
Sienna 9 hashing
Sienna 9 hashingSienna 9 hashing
Sienna 9 hashingchidabdu
 
handle data with DHT and load balnce over P2P network
handle data with DHT and load balnce over P2P networkhandle data with DHT and load balnce over P2P network
handle data with DHT and load balnce over P2P networkHema Priya
 
How Hashing Algorithms Work
How Hashing Algorithms WorkHow Hashing Algorithms Work
How Hashing Algorithms WorkCheapSSLsecurity
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationKushaal Singla
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Big Data Spain
 
Hashing and File Structures in Data Structure.pdf
Hashing and File Structures in Data Structure.pdfHashing and File Structures in Data Structure.pdf
Hashing and File Structures in Data Structure.pdfJaithoonBibi
 
Dnssec tutorial-crypto-defs
Dnssec tutorial-crypto-defsDnssec tutorial-crypto-defs
Dnssec tutorial-crypto-defsAFRINIC
 
Simple Load Rebalancing For Distributed Hash Tables In Cloud
Simple Load Rebalancing For Distributed Hash Tables In CloudSimple Load Rebalancing For Distributed Hash Tables In Cloud
Simple Load Rebalancing For Distributed Hash Tables In CloudIOSR Journals
 
Tapestry
TapestryTapestry
TapestrySutha31
 
On Improving the Performance of Data Leak Prevention using White-list Approach
On Improving the Performance of Data Leak Prevention using White-list ApproachOn Improving the Performance of Data Leak Prevention using White-list Approach
On Improving the Performance of Data Leak Prevention using White-list ApproachPatrick Nguyen
 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingGabriele Angeletti
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...DECK36
 
Supervised Quantization for Similarity Search (camera-ready)
Supervised Quantization for Similarity Search (camera-ready)Supervised Quantization for Similarity Search (camera-ready)
Supervised Quantization for Similarity Search (camera-ready)Xiaojuan (Kathleen) WANG
 

Similar to MinHash_similarities.pdf (20)

Real timefrauddetectiononbigdata
Real timefrauddetectiononbigdataReal timefrauddetectiononbigdata
Real timefrauddetectiononbigdata
 
Sienna 9 hashing
Sienna 9 hashingSienna 9 hashing
Sienna 9 hashing
 
handle data with DHT and load balnce over P2P network
handle data with DHT and load balnce over P2P networkhandle data with DHT and load balnce over P2P network
handle data with DHT and load balnce over P2P network
 
SPIE-2014
SPIE-2014SPIE-2014
SPIE-2014
 
How Hashing Algorithms Work
How Hashing Algorithms WorkHow Hashing Algorithms Work
How Hashing Algorithms Work
 
Hashing 1
Hashing 1Hashing 1
Hashing 1
 
Chapter 12 ds
Chapter 12 dsChapter 12 ds
Chapter 12 ds
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
Hashing and File Structures in Data Structure.pdf
Hashing and File Structures in Data Structure.pdfHashing and File Structures in Data Structure.pdf
Hashing and File Structures in Data Structure.pdf
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Dnssec tutorial-crypto-defs
Dnssec tutorial-crypto-defsDnssec tutorial-crypto-defs
Dnssec tutorial-crypto-defs
 
Simple Load Rebalancing For Distributed Hash Tables In Cloud
Simple Load Rebalancing For Distributed Hash Tables In CloudSimple Load Rebalancing For Distributed Hash Tables In Cloud
Simple Load Rebalancing For Distributed Hash Tables In Cloud
 
Tapestry
TapestryTapestry
Tapestry
 
On Improving the Performance of Data Leak Prevention using White-list Approach
On Improving the Performance of Data Leak Prevention using White-list ApproachOn Improving the Performance of Data Leak Prevention using White-list Approach
On Improving the Performance of Data Leak Prevention using White-list Approach
 
Data types ,variables,array
Data types ,variables,arrayData types ,variables,array
Data types ,variables,array
 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive Hashing
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
Supervised Quantization for Similarity Search (camera-ready)
Supervised Quantization for Similarity Search (camera-ready)Supervised Quantization for Similarity Search (camera-ready)
Supervised Quantization for Similarity Search (camera-ready)
 

Recently uploaded

Marathi Call Girls Santacruz WhatsApp +91-9930687706, Best Service
Marathi Call Girls Santacruz WhatsApp +91-9930687706, Best ServiceMarathi Call Girls Santacruz WhatsApp +91-9930687706, Best Service
Marathi Call Girls Santacruz WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
9352852248 Call Girls Gota Escort Service Available 24×7 In Gota
9352852248 Call Girls  Gota Escort Service Available 24×7 In Gota9352852248 Call Girls  Gota Escort Service Available 24×7 In Gota
9352852248 Call Girls Gota Escort Service Available 24×7 In Gotagargpaaro
 
Top profile Call Girls In Anand [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Anand [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Anand [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Anand [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
一比一原版西安大略大学毕业证(UWO毕业证)成绩单原件一模一样
一比一原版西安大略大学毕业证(UWO毕业证)成绩单原件一模一样一比一原版西安大略大学毕业证(UWO毕业证)成绩单原件一模一样
一比一原版西安大略大学毕业证(UWO毕业证)成绩单原件一模一样wsppdmt
 
一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理
一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理
一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理ezgenuh
 
在线定制(UBC毕业证书)英属哥伦比亚大学毕业证成绩单留信学历认证原版一比一
在线定制(UBC毕业证书)英属哥伦比亚大学毕业证成绩单留信学历认证原版一比一在线定制(UBC毕业证书)英属哥伦比亚大学毕业证成绩单留信学历认证原版一比一
在线定制(UBC毕业证书)英属哥伦比亚大学毕业证成绩单留信学历认证原版一比一qh1ao5mm
 
Nangloi Jat Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nangloi Jat
Nangloi Jat Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nangloi JatNangloi Jat Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nangloi Jat
Nangloi Jat Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nangloi Jatmeghakumariji156
 
如何办理多伦多大学毕业证(UofT毕业证书)成绩单原版一比一
如何办理多伦多大学毕业证(UofT毕业证书)成绩单原版一比一如何办理多伦多大学毕业证(UofT毕业证书)成绩单原版一比一
如何办理多伦多大学毕业证(UofT毕业证书)成绩单原版一比一opyff
 
如何办理(NCL毕业证书)纽卡斯尔大学毕业证毕业证成绩单原版一比一
如何办理(NCL毕业证书)纽卡斯尔大学毕业证毕业证成绩单原版一比一如何办理(NCL毕业证书)纽卡斯尔大学毕业证毕业证成绩单原版一比一
如何办理(NCL毕业证书)纽卡斯尔大学毕业证毕业证成绩单原版一比一avy6anjnd
 
Is Your Volvo XC90 Displaying Anti-Skid Service Required Alert Here's Why
Is Your Volvo XC90 Displaying Anti-Skid Service Required Alert Here's WhyIs Your Volvo XC90 Displaying Anti-Skid Service Required Alert Here's Why
Is Your Volvo XC90 Displaying Anti-Skid Service Required Alert Here's WhyBavarium Autoworks
 
Top profile Call Girls In Ranchi [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Ranchi [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Ranchi [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Ranchi [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Top profile Call Girls In dharamshala [ 7014168258 ] Call Me For Genuine Mode...
Top profile Call Girls In dharamshala [ 7014168258 ] Call Me For Genuine Mode...Top profile Call Girls In dharamshala [ 7014168258 ] Call Me For Genuine Mode...
Top profile Call Girls In dharamshala [ 7014168258 ] Call Me For Genuine Mode...gajnagarg
 
一比一原版(Greenwich毕业证书)格林威治大学毕业证如何办理
一比一原版(Greenwich毕业证书)格林威治大学毕业证如何办理一比一原版(Greenwich毕业证书)格林威治大学毕业证如何办理
一比一原版(Greenwich毕业证书)格林威治大学毕业证如何办理bd2c5966a56d
 
一比一原版(Deakin毕业证书)迪肯大学毕业证成绩单留信学历认证
一比一原版(Deakin毕业证书)迪肯大学毕业证成绩单留信学历认证一比一原版(Deakin毕业证书)迪肯大学毕业证成绩单留信学历认证
一比一原版(Deakin毕业证书)迪肯大学毕业证成绩单留信学历认证62qaf0hi
 
Stacey+= Dubai Calls Girls O525547819 Call Girls In Dubai
Stacey+= Dubai Calls Girls O525547819 Call Girls In DubaiStacey+= Dubai Calls Girls O525547819 Call Girls In Dubai
Stacey+= Dubai Calls Girls O525547819 Call Girls In Dubaikojalkojal131
 
John Deere Tractors 5415 Diagnostic Repair Service Manual.pdf
John Deere Tractors 5415 Diagnostic Repair Service Manual.pdfJohn Deere Tractors 5415 Diagnostic Repair Service Manual.pdf
John Deere Tractors 5415 Diagnostic Repair Service Manual.pdfExcavator
 
Housewife Call Girl in Faridabad ₹7.5k Pick Up & Drop With Cash Payment #8168...
Housewife Call Girl in Faridabad ₹7.5k Pick Up & Drop With Cash Payment #8168...Housewife Call Girl in Faridabad ₹7.5k Pick Up & Drop With Cash Payment #8168...
Housewife Call Girl in Faridabad ₹7.5k Pick Up & Drop With Cash Payment #8168...Hyderabad Escorts Agency
 
Why Does My Porsche Cayenne's Exhaust Sound So Loud
Why Does My Porsche Cayenne's Exhaust Sound So LoudWhy Does My Porsche Cayenne's Exhaust Sound So Loud
Why Does My Porsche Cayenne's Exhaust Sound So LoudRoyalty Auto Service
 
JOHN DEERE 7200R 7215R 7230R 7260R 7280R TECHNICAL SERVICE PDF MANUAL 2680PGS...
JOHN DEERE 7200R 7215R 7230R 7260R 7280R TECHNICAL SERVICE PDF MANUAL 2680PGS...JOHN DEERE 7200R 7215R 7230R 7260R 7280R TECHNICAL SERVICE PDF MANUAL 2680PGS...
JOHN DEERE 7200R 7215R 7230R 7260R 7280R TECHNICAL SERVICE PDF MANUAL 2680PGS...Excavator
 
如何办理新西兰林肯大学毕业证(Lincoln毕业证书)成绩单原版一比一
如何办理新西兰林肯大学毕业证(Lincoln毕业证书)成绩单原版一比一如何办理新西兰林肯大学毕业证(Lincoln毕业证书)成绩单原版一比一
如何办理新西兰林肯大学毕业证(Lincoln毕业证书)成绩单原版一比一opyff
 

Recently uploaded (20)

Marathi Call Girls Santacruz WhatsApp +91-9930687706, Best Service
Marathi Call Girls Santacruz WhatsApp +91-9930687706, Best ServiceMarathi Call Girls Santacruz WhatsApp +91-9930687706, Best Service
Marathi Call Girls Santacruz WhatsApp +91-9930687706, Best Service
 
9352852248 Call Girls Gota Escort Service Available 24×7 In Gota
9352852248 Call Girls  Gota Escort Service Available 24×7 In Gota9352852248 Call Girls  Gota Escort Service Available 24×7 In Gota
9352852248 Call Girls Gota Escort Service Available 24×7 In Gota
 
Top profile Call Girls In Anand [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Anand [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Anand [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Anand [ 7014168258 ] Call Me For Genuine Models We ...
 
一比一原版西安大略大学毕业证(UWO毕业证)成绩单原件一模一样
一比一原版西安大略大学毕业证(UWO毕业证)成绩单原件一模一样一比一原版西安大略大学毕业证(UWO毕业证)成绩单原件一模一样
一比一原版西安大略大学毕业证(UWO毕业证)成绩单原件一模一样
 
一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理
一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理
一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理
 
在线定制(UBC毕业证书)英属哥伦比亚大学毕业证成绩单留信学历认证原版一比一
在线定制(UBC毕业证书)英属哥伦比亚大学毕业证成绩单留信学历认证原版一比一在线定制(UBC毕业证书)英属哥伦比亚大学毕业证成绩单留信学历认证原版一比一
在线定制(UBC毕业证书)英属哥伦比亚大学毕业证成绩单留信学历认证原版一比一
 
Nangloi Jat Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nangloi Jat
Nangloi Jat Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nangloi JatNangloi Jat Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nangloi Jat
Nangloi Jat Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nangloi Jat
 
如何办理多伦多大学毕业证(UofT毕业证书)成绩单原版一比一
如何办理多伦多大学毕业证(UofT毕业证书)成绩单原版一比一如何办理多伦多大学毕业证(UofT毕业证书)成绩单原版一比一
如何办理多伦多大学毕业证(UofT毕业证书)成绩单原版一比一
 
如何办理(NCL毕业证书)纽卡斯尔大学毕业证毕业证成绩单原版一比一
如何办理(NCL毕业证书)纽卡斯尔大学毕业证毕业证成绩单原版一比一如何办理(NCL毕业证书)纽卡斯尔大学毕业证毕业证成绩单原版一比一
如何办理(NCL毕业证书)纽卡斯尔大学毕业证毕业证成绩单原版一比一
 
Is Your Volvo XC90 Displaying Anti-Skid Service Required Alert Here's Why
Is Your Volvo XC90 Displaying Anti-Skid Service Required Alert Here's WhyIs Your Volvo XC90 Displaying Anti-Skid Service Required Alert Here's Why
Is Your Volvo XC90 Displaying Anti-Skid Service Required Alert Here's Why
 
Top profile Call Girls In Ranchi [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Ranchi [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Ranchi [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Ranchi [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In dharamshala [ 7014168258 ] Call Me For Genuine Mode...
Top profile Call Girls In dharamshala [ 7014168258 ] Call Me For Genuine Mode...Top profile Call Girls In dharamshala [ 7014168258 ] Call Me For Genuine Mode...
Top profile Call Girls In dharamshala [ 7014168258 ] Call Me For Genuine Mode...
 
一比一原版(Greenwich毕业证书)格林威治大学毕业证如何办理
一比一原版(Greenwich毕业证书)格林威治大学毕业证如何办理一比一原版(Greenwich毕业证书)格林威治大学毕业证如何办理
一比一原版(Greenwich毕业证书)格林威治大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证成绩单留信学历认证
一比一原版(Deakin毕业证书)迪肯大学毕业证成绩单留信学历认证一比一原版(Deakin毕业证书)迪肯大学毕业证成绩单留信学历认证
一比一原版(Deakin毕业证书)迪肯大学毕业证成绩单留信学历认证
 
Stacey+= Dubai Calls Girls O525547819 Call Girls In Dubai
Stacey+= Dubai Calls Girls O525547819 Call Girls In DubaiStacey+= Dubai Calls Girls O525547819 Call Girls In Dubai
Stacey+= Dubai Calls Girls O525547819 Call Girls In Dubai
 
John Deere Tractors 5415 Diagnostic Repair Service Manual.pdf
John Deere Tractors 5415 Diagnostic Repair Service Manual.pdfJohn Deere Tractors 5415 Diagnostic Repair Service Manual.pdf
John Deere Tractors 5415 Diagnostic Repair Service Manual.pdf
 
Housewife Call Girl in Faridabad ₹7.5k Pick Up & Drop With Cash Payment #8168...
Housewife Call Girl in Faridabad ₹7.5k Pick Up & Drop With Cash Payment #8168...Housewife Call Girl in Faridabad ₹7.5k Pick Up & Drop With Cash Payment #8168...
Housewife Call Girl in Faridabad ₹7.5k Pick Up & Drop With Cash Payment #8168...
 
Why Does My Porsche Cayenne's Exhaust Sound So Loud
Why Does My Porsche Cayenne's Exhaust Sound So LoudWhy Does My Porsche Cayenne's Exhaust Sound So Loud
Why Does My Porsche Cayenne's Exhaust Sound So Loud
 
JOHN DEERE 7200R 7215R 7230R 7260R 7280R TECHNICAL SERVICE PDF MANUAL 2680PGS...
JOHN DEERE 7200R 7215R 7230R 7260R 7280R TECHNICAL SERVICE PDF MANUAL 2680PGS...JOHN DEERE 7200R 7215R 7230R 7260R 7280R TECHNICAL SERVICE PDF MANUAL 2680PGS...
JOHN DEERE 7200R 7215R 7230R 7260R 7280R TECHNICAL SERVICE PDF MANUAL 2680PGS...
 
如何办理新西兰林肯大学毕业证(Lincoln毕业证书)成绩单原版一比一
如何办理新西兰林肯大学毕业证(Lincoln毕业证书)成绩单原版一比一如何办理新西兰林肯大学毕业证(Lincoln毕业证书)成绩单原版一比一
如何办理新西兰林肯大学毕业证(Lincoln毕业证书)成绩单原版一比一
 

MinHash_similarities.pdf

  • 1. 3.1.2 MinHash for similarities The adjacent matrix is helpful, but there is difficulty with applications when the data is large. They are primarily large and increase complexity. We are estimating the similarity of all pairs in <<Q(n2 ). This is problematic if we want to use, let us take an example, a commerce site with 12 million products. We want to identify and provide a ranking similarity score. The total pairs will be 12 million elements, 144 x 1012 pairs. Each pair has a 64-bit float, and we need 1.152 x 1015 bytes to store the adjacency matrix at the memory. In such large measurements make it difficult to use these data. Things get rough when we have to go to a more extensive dataset like a social network dataset or web data set. Besides, the data is highly likely to have many features - columns. So, it is exceedingly challenging to store this data at the memory and perform similarity checks; we have to find an alternative technique to locate groups of high similarity pairs. We cannot check all the pairs. The MinHash allows us to compress all these features to a smaller dimensional space that works well and maintains high dimensionality [20,9]. The basic idea is that the compressed feature spaces maintain similarities among the two objects. Small signatures will be smaller than the full feature vector. The similarity between these signatures is equivalent or very similar to the full feature space. Then with Jaccard similarity, because we have a set, we can find a similar set. The MinHash lets us evaluate similarity in low dimensional space. The locality sensitive hashing allows (LSH) us to deal with the pair problem. We only evaluate similarity for some candidates set. Some pairs only matter if they exceed a threshold, which lets us skip a lot of pair checking. While computing the small signature, we do not have to store the full feature vector. Similarities of two pairs are equal with similarities to their signature. Moreover, the final step is to check the pairs with similar signature to measure the similarity with the feature vector. The key idea is to hash each element with a hash function. Hashing is converting input of any length into a fixed-size string of text using a mathematical function. Any text can be converting into an array of numbers and letters through the algorithm. The messages will be hashed the input. The algorithm is called hashed function, and the output is called hashed values. The hashed values must be unique; it should be impossible to produce the same hashed values to any
  • 2. different input. The same message should always produce the same hashed values. The hash speed is an essential factor. The hash function should always produce quick hash values. The hash value has to be small enough that the signature fits in memory, and Sim(C1,C2) are the same with h(C1) and h(C2), also; if Sim(C1) and Sim(C2) are high, then the probability to h(C1) and h(C2) is high. We have to know that not all similarity hash a suitable function. For example, Jaccard similarity is suitable for MinHash. The similarities of the two signatures are the fraction of the hash function in which they agree. Finally, with MinHash, we compressed long vectors into a short signature[20,9,21].