MinHash is an algorithm that allows for estimating the similarity of large datasets in sub-quadratic time. It works by compressing high dimensional feature vectors into smaller "signatures" such that the Jaccard similarity between any two vectors is approximately equal to the similarity between their signatures. Locality-sensitive hashing allows evaluating similarity only for candidate pairs that may exceed a threshold, avoiding checking all pairs. The MinHash signatures are generated by applying multiple hash functions to the feature vectors, such that similar vectors are likely to hash to the same values. This allows finding similar objects without storing the entire feature space.
1. 3.1.2 MinHash for similarities
The adjacent matrix is helpful, but there is difficulty with applications when the data is large. They are
primarily large and increase complexity. We are estimating the similarity of all pairs in <<Q(n2
). This is
problematic if we want to use, let us take an example, a commerce site with 12 million products. We
want to identify and provide a ranking similarity score. The total pairs will be 12 million elements, 144
x 1012
pairs. Each pair has a 64-bit float, and we need 1.152 x 1015
bytes to store the adjacency matrix
at the memory. In such large measurements make it difficult to use these data. Things get rough when
we have to go to a more extensive dataset like a social network dataset or web data set. Besides, the
data is highly likely to have many features - columns. So, it is exceedingly challenging to store this data
at the memory and perform similarity checks; we have to find an alternative technique to locate groups
of high similarity pairs. We cannot check all the pairs. The MinHash allows us to compress all these
features to a smaller dimensional space that works well and maintains high dimensionality [20,9].
The basic idea is that the compressed feature spaces maintain similarities among the two objects. Small
signatures will be smaller than the full feature vector. The similarity between these signatures is
equivalent or very similar to the full feature space. Then with Jaccard similarity, because we have a set,
we can find a similar set. The MinHash lets us evaluate similarity in low dimensional space. The locality
sensitive hashing allows (LSH) us to deal with the pair problem. We only evaluate similarity for some
candidates set. Some pairs only matter if they exceed a threshold, which lets us skip a lot of pair
checking. While computing the small signature, we do not have to store the full feature vector.
Similarities of two pairs are equal with similarities to their signature. Moreover, the final step is to check
the pairs with similar signature to measure the similarity with the feature vector. The key idea is to hash
each element with a hash function.
Hashing is converting input of any length into a fixed-size string of text using a mathematical function.
Any text can be converting into an array of numbers and letters through the algorithm. The messages
will be hashed the input. The algorithm is called hashed function, and the output is called hashed values.
The hashed values must be unique; it should be impossible to produce the same hashed values to any
2. different input. The same message should always produce the same hashed values. The hash speed is an
essential factor. The hash function should always produce quick hash values.
The hash value has to be small enough that the signature fits in memory, and Sim(C1,C2) are the same
with h(C1) and h(C2), also; if Sim(C1) and Sim(C2) are high, then the probability to h(C1) and h(C2) is
high. We have to know that not all similarity hash a suitable function. For example, Jaccard similarity is
suitable for MinHash. The similarities of the two signatures are the fraction of the hash function in which
they agree. Finally, with MinHash, we compressed long vectors into a short signature[20,9,21].