How do you find similar movies, articles or users? Calculating similarity in db of 2M movies and 40M checkins in miliseconds. Minhashing trick with fulltext search engine.
http://lanyrd.com/2013/rubyslava-july/sckrzm/
6. MinHashing
Key idea:
What is the probability that two sets have the
same minimum value?
P( s(A) = s(B) ) = J(A, B)
7. MinHashing
def calculate_minhash(hash_function, set)
minhash = Infinity
set.each do |item|
value = hash_function.call(item)
minhash = value if value < minhash
end
minhash
end
def generate_signature(set)
@hash_functions.map do |hash_function|
calculate_minhash(hash_function, set)
end
end
11. Features
● easily updatable on insertion
○ calculate minhash of new element
○ update existing set minhash if new minhash is lower
● tunable precision at query time!
○ use bigger/smaller part of precalculated signature
12. Extensions
● Weighting
○ not all items in set have same weight
● Locality sensitive hashing
○ shingling
○ boosting high similarity matching
○ e.g. near duplicate detection
13. Resources
● Finding similar items using minhashing
http://www.toao.com/posts/finding-similar-items-key-store-minhashing.html
● Mining of Massive Datasets
http://infolab.stanford.edu/~ullman/mmds.html