Probabilistic and approximate data structures can provide scalable solutions when exact answers are not required. They trade accuracy for speed and efficiency. Approaches like sampling, hashing, cardinality estimation, and probabilistic databases allow analyzing large datasets while controlling error rates. Example techniques discussed include Bloom filters, locality-sensitive hashing, count-min sketches, HyperLogLog, and feature hashing for machine learning. The talk provided code examples and comparisons of these probabilistic methods.
2. Probabilistic||Approximate: Why?
Often:
● an approximate answer is sufficient
● need to trade accuracy for scalability or speed
● need to analyse stream of data
Catch:
● despite typically achieving good result, exists a
chance of the bad worst case behaviour.
● use on large datasets (law of large numbers)
3. Code: Approximation
import random
x = [random.randint(0,80000) for _ in xrange(10000)]
y = [i>>8 for i in x] # trim 8 bits off of integers
z = x[:500]
# 5% sample (x is uniform)
avx = average(x)
avy = average(y) * 2**8 # add 8 bits
avz = average(z)
print avx
print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx))
print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx))
39547.8816
39420.7744 error 0.321401%
39591.424 error 0.110100%
5. Probabilistic Data Structures
Generally they are:
● Use less space than a full dataset
● Require higher CPU load
● Stream-friendly
● Can be parallelized
● Have controlled error rate
6. Hash functions
One-way function:
arbitrary length of the key ->
to a fixed length of the message
message = hash(key)
However, collisions are possible:
hash(key1) = hash(key2)
11. Comparison: Locality Sensitive Hashing (LSH)
Image hashes
Kernelized locality-sensitive hashing for scalable image search
B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org
Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed highdimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
12. Membership test: Bloom filter
Bloom filter is probabilistic but only yields false positives.
Hash each item k times indices into bit field.
`
At least one 0 means
w definitely isn’t in set.
All 1s would mean w
probably is in set.
1..m
15. Use Bloom filter to store graphs
Graphs only gain nodes because of Bloom
filter false positives.
Pell et al., PNAS 2012
16. Counting Distinct Elements
In:
infinite stream of data
Question: how many distinct elements are there?
is similar to:
In:
coin flips
Question: how many times it has been flipped?
17. Coin flips: intuition
● Long runs of HEADs in random series are rare.
● The longer you look, the more likely you see a long one.
● Long runs are very rare and are correlated with how
many coins you’ve flipped.
19. Cardinality estimation
Basic algorithm:
●
●
n=0
For each input item:
○ Hash item into bit string
○ Count trailing zeroes in bit string
○ If this count > n:
■ Let n = count
●
Estimated cardinality (“count distinct”) = 2^n
20. Cardinality estimation: HyperLogLog
Demo by: http://www.
aggregateknowledge.
com/science/blog/hll.html
Billions of distinct values in 1.5KB of
RAM with 2% relative error
HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm
P.Flajolet, É.Fusy, O.Gandouet, F.Meunier;
2007
24. Machine Learning: Feature hashing
High-dimensional
machine learning without
feature dictionary
by Andrew Clegg “Approximate methods for
scalable data mining”
29. References
Mining of Massive Datasets
by Jure Leskovec, Anand Rajaraman, and Jeff Ullman
http://infolab.stanford.edu/~ullman/mmds.html
30. Summary
● know the data structures
● know what you sacrifice
● control errors
http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df
http://highlyscalable.wordpress.com/2012/05/01/probabilisticstructures-web-analytics-data-mining/ by Ilya Katsov